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MESSAGE  FROM  THE  GENERAL  CHAIR 


Welcome  to  the  9th  Heterogeneous  Computing  Workshop  (HCW  2000).  The  field  of 
heterogeneous  computing  continues  to  mature,  with  several  focused  themes  evolving  over  the 
past  few  years.  These  include  cluster  computing,  grid  computing,  and  metacomputing,  among 
others.  The  Heterogeneous  Computing  Workshop  series  offers  an  international  forum  for 
researchers  in  all  of  these  overlapping  areas  to  present  their  research  findings  and  interact  with 
their  peers. 

I  would  like  to  thank  H.J.  Siegel,  the  Steering  Committee  Chair,  for  inviting  me  to  be  the 
General  Chair.  Throughout  the  past  year,  he  provided  me  with  invaluable  inputs  in  resolving 
meeting-related  issues.  I  also  would  like  to  thank  C.S.  Raghavendra,  the  Program  Chair.  He  did 
an  outstanding  job  in  putting  together  an  excellent  technical  program  that  addresses  diverse 
aspects  of  heterogeneous  computing.  In  addition,  he  assisted  in  resolving  meeting-related  issues 
including  planning  and  publicity.  It  was  a  pleasure  working  with  H.J.  and  Raghu. 

This  year,  the  response  to  the  call  for  papers  was  overwhelming.  For  the  first  time,  we  had  to 
arrange  parallel  sessions  to  accommodate  so  many  excellent  papers  that  were  submitted!  I  would 
like  to  thank  Susamma  Barua,  DPDPS  2000  Local  Arrangements  Chair,  for  her  assistance  in 
arranging  the  meeting  space  for  us. 

The  workshop  is  cosponsored  by  the  US  Office  of  Naval  Research  and  the  IEEE  Computer 
Society  Technical  Committee  on  Parallel  Processing.  I  would  like  to  thank  Richard  Freund  of 
NOEMIX,  Inc,  for  his  continued  support  and  guidance  of  the  meeting  series. 

Muthucumaru  Maheswaran  acted  as  the  Publicity  Chair.  I  would  like  to  thank  him  for  the 
excellent  job  he  did  in  maintaining  our  website  as  well  as  publicizing  the  meeting.  Denise 
Williams  of  the  IEEE  Computer  Society  Press  deserves  special  mention  for  her  efforts  in  putting 
together  the  proceedings.  Finally,  I  would  like  to  thank  my  assistant  Henryk  Chrostek  for 
coordinating  meeting  related  interactions  over  the  past  year.  It  was  a  pleasure  to  work  with  all  of 
them. 


Viktor  K.  Prasanna 

University  of  Southern  California 
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MESSAGE  FROM  THE  PROGRAM  CHAIR 

It  has  been  a  pleasure  to  organize  the  9th  Heterogeneous  Computing  Workshop  (HCW  2000).  This 
workshop  is  a  forum  to  discuss  the  latest  findings  in  heterogeneous  computing  and  promising  work-in- 
progress.  Heterogeneous  computing  systems  range  from  diverse  elements  within  a  single  computer,  to 
coordinated,  geographically-distributed  machines  with  different  architectures.  A  heterogeneous 
computing  system  provides  a  variety  of  capabilities  that  can  be  orchestrated  to  execute  multiple  tasks  with 
varied  computational  requirements.  Applications  in  these  environments  achieve  performance  by 
exploiting  the  affinity  of  different  tasks  to  different  computational  platforms  or  paradigms,  while 
considering  the  overhead  of  inter-task  communication  and  the  coordination  of  distinct  data  sources  and 
administrative  domains.  Such  computing  systems  support  information  infrastructure  and  other  terms, 
including  Cluster  Computing  and  Grid  Computing ,  are  also  used  to  describe  heterogeneous  computing. 

In  prior  years,  the  HCW  program  had  a  single  track  of  paper  presentations  with  an  invited  paper 
session.  This  year  we  received  many  quality  papers,  and  with  the  surveyed  opinions  of  authors,  we 
decided  to  have  two  parallel  tracks.  The  technical  program  includes  32  papers  arranged  in  10  sessions 
al°ng  the  two  parallel  tracks.  Each  of  the  submitted  papers  was  reviewed  by  two  program  committee 
members  and  external  reviewers.  These  presentations  cover  a  range  of  heterogeneous  computing  topics, 
including  grid  computing  and  applications,  scheduling  algorithms,  theory  and  modeling,  and  resource 
management.  I  would  like  to  thank  the  members  of  the  Technical  Program  Committee  for  their  valuable 
and  timely  review  work  and  their  help  in  putting  together  the  program. 

The  success  of  a  workshop  such  as  this  depends  on  the  contributions  of  many  individuals.  First,  I 
would  like  to  thank  Viktor  Prasanna,  the  General  Chair,  for  inviting  me  to  be  the  Program  Chair  and  for 
providing  me  with  a  number  of  pointers  on  organizational  matters.  Next,  I  would  like  to  thank  H.J.  Siegel, 
the  Steering  Committee  Chair,  for  his  continued  support  on  resolving  meeting-related  issues  and  for  his 
help  in  getting  the  financial  support  for  publishing  the  workshop  proceedings.  I  also  would  like  to  thank 
Muthucumaru  Maheswaran  for  publicizing  this  workshop  through  various  on-line  mailing  lists  and 
postings  on  the  Web.  I  am  thankful  to  Henryk  Chrostek  and  Ammar  Alhusaini  for  their  help  in  handling 
submitted  papers  and  in  the  organization  of  the  Program  Committee  meeting.  Finally,  on  behalf  of  the 
Program  Committee,  I  would  like  to  extend  my  gratitude  to  the  authors,  session  chairpersons,  and  the 
reviewers  who  contributed  to  making  the  9th  Heterogeneous  Computing  Workshop  a  success. 

Cauligi  S.  Raghavendra 

The  Aerospace  Corporation 
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MESSAGE  FROM  THE  STEERING  COMMITTEE  CHAIR 

These  are  the  proceedings  of  the  9th  Heterogeneous  Computing  Workshop,  also  known  as  HCW  2000. 
Heterogeneous  computing  is  a  very  important  research  area  with  great  practical  impact.  The  topic  of 
heterogeneous  systems  covers  many  types  of  systems.  A  heterogeneous  system  may  be  set  of  machines 
interconnected  by  a  wide-area  network  and  used  to  support  the  execution  of  jobs  submitted  by  a  large  variety 
of  users  to  process  data  that  is  distributed  throughout  the  system.  A  heterogeneous  system  may  be  a  suite  of 
high-performance  machines  tightly  interconnected  by  a  fast  dedicated  local-area  network  and  used  to  process  a 
set  of  production  tasks,  where  the  subtasks  of  each  task  may  execute  on  different  machines  in  the  suite.  A 
heterogeneous  system  may  also  be  a  special-purpose  embedded  system,  such  as  a  set  of  different  types  of 
processors  used  for  automatic  target  recognition.  In  the  extreme,  a  heterogeneous  system  may  consist  of  a 
single  machine  that  can  reconfigure  itself  to  operate  in  different  ways  (e.g.,  in  different  modes  of  parallelism). 
All  of  these  types  of  heterogeneous  systems  (as  well  as  others)  are  appropriate  topics  for  this  workshop  series. 

I  hope  you  find  the  contents  of  these  proceedings  informative  and  interesting.  I  encourage  you  to  look  also  at 
the  proceedings  of  past  and  future  Heterogeneous  Computing  Workshops. 

Many  people  have  worked  very  hard  to  make  this  workshop  happen.  Cauligi  “Raghu”  Raghavendra,  of  the 
University  of  Southern  California  and  The  Aerospace  Corporation,  was  this  year’s  Program  Committee  Chair, 
and  he  assembled  the  excellent  program  and  collection  of  papers  in  these  proceedings.  Raghu  did  this  with  the 
assistance  of  his  Program  Committee,  which  is  listed  in  these  proceedings.  Viktor  Prasanna,  of  the  University 
of  Southern  California,  was  the  General  Chair,  and  he  was  responsible  for  the  overall  organization  and 
administration  of  this  year’s  workshop,  and  he  did  a  fine  job.  I  thank  Richard  F.  Freund,  of  NOEMIX,  for 
founding  this  workshop  series,  and  for  asking  me  to  succeed  him  as  Chair  of  the  Steering  Committee. 

Due  to  the  increasing  importance  of  this  research  area  and  the  efforts  of  the  workshop  organizing 
committee,  we  received  so  many  excellent  submissions  this  year  that  we  had  to  go  to  parallel  sessions  for  the 
first  time.  While  we  realize  this  splits  the  audience,  we  did  not  want  to  turn  away  good  papers  simply  because 
this  is  a  one-day  workshop  (and  we  did  not  want  to  extend  the  workshop  another  day,  conflicting  with  our  host 
symposium’s  sessions). 

This  year  IEEE  Computer  Society  and  the  Office  of  Naval  Research  (ONR)  cosponsored  the  workshop, 
with  additional  support  given  from  our  industrial  affiliate,  NOEMIX.  I  thank  Andre  M.  van  Tilborg,  the 
Director  of  the  Math,  Computer,  &  Information  Sciences  Division  of  the  Office  of  Naval  Research,  for 
arranging  funding  for  the  publication  of  the  workshop  proceedings  (under  grant  number  N000 14-00- 1-0 189). 
We  greatly  appreciate  their  continued  support  of  our  proceedings.  I  thank  Richard  F.  Freund,  of  NOEMIX,  for 
again  providing  the  plaque  given  to  the  Program  Chair  in  recognition  of  his  efforts. 

This  workshop  is  held  in  conjunction  with  the  International  Parallel  and  Distributed  Processing 
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Abstract 

Resource  selection  is  fundamental  to  the  performance 
of  master/slave  applications.  In  this  paper,  we  address 
the  problem  of  promoting  performance  for  distributed  mas¬ 
ter/slave  applications  targeted  to  distributed,  heteroge¬ 
neous  "Grid”  resources .  We  present  a  work-rate-based 
model  of  master/slave  application  performance  which  uti¬ 
lizes  both  system  and  application  characteristics  to  select 
potentially  performance- efficient  hosts  for  both  the  master 
and  slave  processes.  Using  a  Grid  allocation  strategy  based 
on  this  performance  model,  we  demonstrate  a  performance 
improvement  over  other  selection  options  for  a  representa¬ 
tive  set  of  Master/Slave  applications  in  both  simulated  and 
actual  Grid  environments. 


1.  Introduction 

The  master/slave  paradigm  is  a  fundamental  and  com¬ 
monly  used  approach  for  parallel  and  distributed  applica¬ 
tions.  In  master/slave  applications,  a  single  master  process 
controls  the  distribution  of  work  to  a  set  of  identically  oper¬ 
ating  slave  processes.  The  master/slave  paradigm  has  been 
used  successfully  for  a  wide  class  of  parallel  applications 
[12][6][14],  and  is  well  suited  as  a  programming  model  for 
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applications  targeted  to  distributed,  heterogeneous  ’Grid” 
resources[l]. 

Methods  which  can  improve  the  performance  of  mas¬ 
ter/slave  applications  are  of  considerable  interest  to  many 
people.  Researchers  and  application  developers  have  pre¬ 
viously  experimented  with  tuning  the  granularity  of  master 
and  slave  processes  to  balance  computation  and  communi¬ 
cation,  varying  parameters  such  as  the  number  and  com¬ 
plexity  of  tasks  assigned  to  slaves,  and  varying  the  number 
of  slave  processes  used  [3]  [8]  [16].  Note  that  in  a  homoge¬ 
neous  environment,  any  processor  can  reasonably  be  chosen 
as  a  master  or  a  slave,  as  all  resources  are  typically  consid¬ 
ered  to  be  equivalent.  However,  in  a  heterogeneous  Grid  en¬ 
vironment,  non-uniformity  in  both  the  peak  and  deliverable 
capacities  of  computational  and  communication  resources 
can  produce  very  different  application  execution  times  de¬ 
pending  on  which  processor  is  chosen  for  the  master  and 
which  processors  are  chosen  for  the  slaves. 

In  this  paper,  we  address  the  problem  of  how  to  deter¬ 
mine  a  performance-efficient  placement  of  master  and  slave 
processes  running  in  shared ,  distributed  and  heterogeneous 
environments.  In  a  heterogeneous  environment,  the  choice 
of  processor  for  the  master  can  have  a  significant  effect  on 
total  available  work  rate,  directly  impacting  application  per¬ 
formance.  Our  strategy  for  selecting  a  location  for  the  mas¬ 
ter  process  involves  identifying  the  host  processor  which 
allows  for  the  largest  aggregated  system  work  rate,  which 
we  will  define  in  the  next  section.  Our  strategy  for  select¬ 
ing  slaves  utilizes  the  performance  capacity  of  the  available 
computation  and  communication  resources  to  determine  a 
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performance-efficient  collection  of  workers. 

This  paper  is  organized  as  follows:  Section  2  provides 
a  performance  model  for  distributed  master/slave  applica¬ 
tions.  Section  3  describes  how  we  obtain  and  use  input  pa¬ 
rameters  for  calculating  resource  work  capacity  values  in 
our  performance  models.  Section  4  describes  our  algorithm 
for  selecting  the  resources  to  use  for  the  master  and  slave 
processes.  Section  5  gives  a  representative  set  of  perfor¬ 
mance  results  from  our  experiments,  Section  6  includes  a 
short  discussion  of  some  related  work,  and  Section  7  pro¬ 
vides  a  summary  of  our  work. 

2.  A  master/slave  performance  model 

We  consider  a  model  of  master/slave  applications  in 
which  the  primary  function  of  the  master  process  m  is  to 
pass  out  and  collect  work  from  a  set  of  slave  processes 
s  e  S. 1  We  assume  that  communication  patterns  are  simple 
and  well-defined,  requiring  communication  only  between 
the  master  process  and  individual  slave  processes.  We  will 
define  the  application’s  work  as  a  divisible  set  of  tasks ; 
where  each  task  may  require  some  input  data  and  produces 
some  output  data. 

Tasks  are  completed  in  an  application  by  progressing 
through  four  stages  in  the  master/slave  computation: 

Stage  1  is  the  transmission  of  a  command  to  initiate  a 
task  on  one  of  the  slave  processes,  including  any  data 
needed  by  the  slave  to  perform  the  computation. 

Stage  2  is  the  execution  of  the  task  by  the  designated  slave. 

Stage  3  is  the  transmission  of  results  from  the  slave  back 
to  the  master. 

Stage  4  is  any  immediate  processing  of  task  results  from 
the  slave  that  must  be  done  by  the  master. 

While  passing  through  each  stage  in  the  computation,  a 
particular  system  resource  must  be  employed  by  a  task  for 
some  period  of  time,  after  which  the  task  can  move  on  to 
the  the  next  stage.  As  an  example,  we  can  consider  the  sim¬ 
ple  network  topology  shown  in  Figure  1.  If  processor  A  in 
Figure  1  is  designated  as  the  master  process,  a  task  intended 
for  slave  processor  B  during  Stage  1  will  employ  the  use  of 
network  Netl  to  transfer  required  data  from  processor  A  to 
processor  B.  During  Stage  2  the  task  will  utilize  processor 
time  on  B  to  run  task  computations.  During  Stage  3  the  task 
will  again  utilize  network  Netl  to  transfer  result  data  from 
B  to  A.  Finally,  during  Stage  4  the  task  will  utilize  proces¬ 
sor  time  on  A  to  process  the  incoming  results  and  to  prepare 
for  initiating  additional  task  transfers  to  B. 

1  It  would  be  straightforward  to  extend  this  work  to  the  case  in  which 
the  master  may  also  perform  some  work  as  a  slave. 


Net3 


Figure  1.  Example  network  configuration. 

In  constructing  a  performance  model  for  master/slave 
applications,  we  look  at  the  rate  that  applications  process 
tasks.  The  rate  at  which  an  application  cycles  through  tasks 
can  be  used  as  a  measure  of  application  performance ,  as 
faster  overall  cycle  rates  will  correspond  directly  to  reduced 
application  execution  times . 

If  we  consider  the  flow  of  tasks  between  a  master  process 
m  and  a  slave  process  s ,  we  can  make  the  definition: 

SlaveRate(m,  s )  is  the  task  completion  rate  occurring  be¬ 
tween  master  m  and  slave  s,  in  units  of  tasks  per  unit 
of  time. 

For  master/slave  computations  where  there  is  no  com¬ 
munication  between  different  slave  processes,  the  total  rate 
of  task  completions  for  an  application  will  be  the  sum  of  the 
rates  arising  from  task  completions  by  individual  slaves.  We 
define  AppRate(rn ,  S )  in  Equation  (1)  to  be  the  rate  of  task 
completions  by  an  application  with  master  process  m  and  a 
set  of  slave  processes  5. 

AppRate(m,  S)  =  ^  SlaveRate(m,  s)  (1) 

We  can  then  define  execution  time,  ExecTime{m,S ), 
for  an  application  with  a  master  process  m  and  the  set  of 
slave  processes  5,  and  where  Tasks  is  the  total  number  of 
tasks  in  the  application. 

ExecTime(m ,  5)  =  Tasks/ AppRate(m,  5),  (2) 

Application  performance  can  thus  be  derived  from  val¬ 
ues  for  SlaveRate{m,s).  One  way  to  solve  for  these 
SlaveRate  values  is  to  consider  the  system  resource  con¬ 
straints  which  bound  achievable  application  performance. 
To  illustrate  the  concept,  we  go  back  to  our  simple  example 
system  in  Figure  1,  and  observe  that  each  processor  and  net¬ 
work  has  been  labeled  with  one  or  more  numerical  values. 
We  define  the  numbers  in  the  diagram  to  represent  resource 
work  capacities  in  terms  of  tasks  per  unit  of  time.  The  val¬ 
ues  next  to  network  links  represent  network  work  capacity 
for  that  network  link,  the  upper  number  within  each  circle 


4 


represents  slave  work  capacity  for  that  processor,  and  the 
lower  number  within  each  circle  represents  master  work  ca¬ 
pacity  for  that  processor 

Consider  an  application  which  uses  processor  C  to  host 
the  master  process.  To  solve  for  application  performance, 
we  would  like  to  determine  values  for  SlaveRate(C,  A), 
S  lave  Rat  e(C,  B)  and  SlaveRate(C,  D).  The  fundamen¬ 
tal  constraint  condition  to  meet  is  that  total  task  flow  rates 
through  any  resource  cannot  exceed  the  capacity  value  of 
that  resource.  This  means,  since  task  flow  from  both  pro¬ 
cessor  A  and  processor  B  passes  through  network  Net3 
in  our  example,  that  the  sum  of  SlaveRate{C,  A)  and 
SlaveRate(C ,  B)  can  be  at  most  50,  the  capacity  of  Net3. 

In  general,  we  can  define  the  following  resource  work 
capacity  terms  for  processor  and  network  resources.  All 
terms  are  rates  with  units  of  tasks  per  unit  of  time. 

WMasterCPu{i)  =  the  maximum  master  work  rate  sup¬ 
ported  by  a  processor  i.  This  is  determined  by  proces¬ 
sor  i’s  capacity  to  perform  Stage  4  computations  for  a 
specified  application. 

WsiaveCPu{i)  =  the  maximum  slave  work  rate  supported 
by  a  processor  i.  This  is  determined  by  processor  z’s 
capacity  to  perform  Stage  2  computations  for  a  speci¬ 
fied  application. 

WNet{n)  —  the  maximum  communication  rate  supported 
by  a  network  n.  This  is  determined  by  network  n’s 
capacity  to  perform  the  Stage  1  and  Stage  3  communi¬ 
cation  of  a  specified  application. 

Assuming  we  have  a  graph  G  representing  network  con¬ 
nectivity  (such  as  the  diagram  in  Figure  1)  that  allows  us  to 
identify  which  network  resources  are  shared  between  dif¬ 
ferent  task  flows,  and  resource  work  capacity  rates  for  each 
of  the  resources  in  our  system,  we  can  form  a  set  of  upper 
bounds  on  possible  SlaveRate(m ,  s)  values.  The  process 
by  which  the  network  connectivity  graph  G  and  the  work 
capacity  rate  terms  can  be  derived  for  resources  in  a  Grid 
environment  will  be  discussed  later  in  section  3. 

First,  to  aid  us  in  defining  our  upper  bound  constraints, 
we  define  a  helper  set  constructor  function: 

ShareN et(G ,  5,  m,n)  takes  as  input  a  network  connec¬ 
tivity  graph  G,  a  set  of  slaves  processes  S ,  a  master 
process  m,  and  a  network  resource  n,  and  returns  the 
set  of  slave  processes  from  S  which  share  the  use  of 
network  resource  n  when  communicating  with  m. 

For  master/slave  applications,  ShareN et(G,S,m,n) 
can  be  easily  determined  for  a  network  graph  G ,  master  pro¬ 
cess  m,  set  of  slaves  5,  and  network  resource  n  by  follow¬ 
ing  the  single  path  in  the  graph  G  from  each  slave  process 
s  £  S  to  the  master  process  m,  recording  each  path  passing 
through  the  resource  n. 


Now  we  can  give  bounds  which  form  constraints  on  ap¬ 
plication  performance,  as  shown  below2. 

^2  SlaveRate(m,  i)  <  WMasterCPu{m)  (3) 

ies 

SlaveRate{m,i)  <WSiaveCpu{i)  (4) 

r,  SlaveRate(m,i)  <  Wffet(n )  (5) 

i(zShareNet(G,S,m,n ) 

Our  goal  is  to  find  the  values  of  SlaveRate(m ,  s)  which 
meet  the  constraints  given  above  and  which  yield  the  largest 
value  of  AppRate(m ,  5).  The  solution  will  correspond  to  a 
configuration  which  delivers  the  best  achievable  application 
performance. 

We  can  frame  the  problem  of  determining  val¬ 
ues  for  SlaveRate(m,s )  which  yield  the  largest 
AppRate{m,  S)  value  as  a  flow-rate  problem  where: 
(1)  the  SlaveRate(m,s)  values  are  the  flows  we  wish 
to  solve  for,  (2)  m  is  the  sink  for  all  flows,  (3)  the  set 
5  of  slave  processes  are  the  sources  for  flows,  and  (4) 
the  flow  constraints  correspond  to  the  W M aster cpu{i). 
Wsiavecpu(i )>  and  W^et  (n)  work  capacities  in  our  target 
environment. 

Because  the  work  flows  in  a  master/slave  computation 
form  a  tree  rooted  at  the  master,  and  because  we  have  lim¬ 
ited  our  investigation  to  considering  no  more  than  one  pro¬ 
cess  hosted  on  each  processor,  efficient  algorithms  like  the 
Maximum-Flow  algorithm  [5]  exist  for  solving  this  prob¬ 
lem.  This  approach  can  be  used  to  solve  the  flow-rate 
problem  for  several  candidate  processes  m,  finding  the  one 
which  is  expected  to  deliver  the  maximum  work  flow,  and 
hence  the  best  expected  application  performance.  Section  4 
describes  the  implementation  of  one  such  maximum-flow 
algorithm  that  can  be  used  to  find  the  largest  possible  work 
flow. 

3.  Modeling  work  capacity  rates  in  a  Grid  en¬ 
vironment 

In  order  to  apply  our  work  flow  performance  model  to 
real  applications  runriing  in  a  Grid  environment,  we  must 
derive  a  network  connectivity  graph  G  and  appropriate 
values  for  the  work  capacity  rate  terms  WMasterCPu(i)> 
WsiaveCPu(i),  and  Wffet(n). 

The  flow-rate  algorithm  for  determining  application  per¬ 
formance  requires  a  graph  G  which  represents  the  network 
connectivity  between  processor  resources.  For  wide-area 

2Since  we  consider  here  only  cases  where  processors  can  host  at  most 
one  process  from  the  same  application,  we  allow  the  process  identifier  to 
be  the  identifier  of  the  processor  hosting  it  in  our  inequality  expressions. 
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Input 

Data 

Description 

Used 

In 

How 

Acquired 

When 

Acquired 

Graph  Net 

Network  connectivity 

G 

ENV 

periodically 

TsiaveCPU 

CPU  slave  task  time 

WsiaveCPU 

benchmark 

install 

T M  asterC  PU 

CPU  master  task  time 

WMasterCPU 

benchmark 

install 

Availcpu 

CPU  availability 

WsiaveCPU . 
WMasterCPU 

NWS 

run-time 

SizeTaskXfer 

Task  data  transfer  size 

wNet 

analysis, 

logging 

application 

BW  Net 

Network  bandwidth 

wNet 

NWS 

run-time 

Table  1.  Inputs  for  constructing  the  performance  model. 


Grid  environments,  it  might  be  very  difficult  to  get  com¬ 
plete  physical  network  configuration  data  about  every  plat¬ 
form  in  the  system.  It  is  reasonable,  however,  to  represent 
the  target  computational  resources  and  their  interconnection 
by  a  logical  view  which  captures  those  areas  where  network 
constraints  present  potential  bottlenecks  to  application  per¬ 
formance.  We  derive  a  logical  view  of  resource  intercon¬ 
nection  using  a  logical  network  configuration  discovery  tool 
called  Effective  Network  Views  (ENV)  [13].  (Other  sys¬ 
tems  for  discovery  of  effective  system  topology  such  as  [9] 
might  also  be  used.)  The  output  of  the  ENV  tool  is  a  net¬ 
work  graph  representation  where  every  processor  belongs 
to  a  cluster  of  one  or  more  machines.  Machines  in  a  clus¬ 
ter  are  connected  together  through  a  local  network,  where 
the  capacity  of  the  local  network  represents  the  limiting  ca¬ 
pacity  of  a  network  resource  shared  by  each  machine  in  the 
cluster.  Clusters  of  local  networks  are  connected  together 
in  our  logical  representation  through  a  single  layer  of  non¬ 
local  network  links.  This  representation  is  suitable  for  use 
in  graph-based  analysis  techniques  like  our  maximum  flow- 
rate  problem,  and  directly  translates  to  the  network  graph  G 
in  our  flow-rate  solution. 

The  processor  work  capacity  rates  WsiaveCPU  )  an^ 
WMasterCPu(i )  in  our  model  are  determined  with  two 
components:  an  application-specific  component  represent¬ 
ing  the  maximum  performance  delivered  by  a  processor 
resource  in  its  unloaded  state,  and  a  dynamic  component 
that  is  determined  at  run-time  to  adjust  capacity  rates  to 
account  for  current  loading  conditions.  The  application- 
specific  component  is  obtained  by  running  a  benchmark  of 
the  target  application  code  on  an  unloaded  processor,  and 
measuring  the  times  TsiaveCPU  (0  and  TMasterCPU  (0  that 
are  required  to  compute  a  single  task  on  processor  type  i  by 
the  slave  and  master  processes  respectively.  If  the  task  com¬ 


putation  time  is  variable  over  time,  perhaps  because  of  data 
dependencies  in  the  application,  we  take  an  average  value 
for  all  task  times  in  one  run  of  the  application  benchmark. 
This  value  could  be  scaled  for  particular  classes  of  data  sets 
at  run-time  if  the  variation  in  average  task  run  times  is  large 
when  different  data  sets  are  used.  The  benchmark  times 
only  have  to  be  measured  once  for  each  platform  type  on 
which  the  application  is  built  to  run,  so  obtaining  these  val¬ 
ues  is  computationally  efficient. 

The  dynamic  component  of  the  work  capacity  terms  for 
processor  resources  is  calculated  with  the  help  of  real-time 
monitoring  and  forecasting  services  such  as  the  Network 
Weather  Service  [19]  (NWS).  The  NWS  provides  real-time 
predictions  of  dynamic  processor  availability  Availcpu(i ) 
(the  percentage  of  CPU  time  a  process  can  expect  to  get  on 
processor  i).  Availcpui})  describes  the  predicted  avail¬ 
ability  status  of  a  processor  resource,  and  can  be  generated 
independently  from  any  particular  application.  This  enables 
a  single  NWS  system  to  provide  simultaneous  service  to 
many  applications  requiring  real-time  information  about  re¬ 
source  behavior. 

The  processor  work  capacity  rates  can  be  calculated  us¬ 
ing  the  application-specific  and  dynamic  components  as 
shown  below.  The  input  parameters  for  these  functions  are 
summarized  in  Table  1 . 

WsiaveCPu(i)  =  AvailCPu(i) /TsiaveCPU  (i)  (6) 

WMasterCPu{i )  =  Availcpu(i) /TMasterCPU  (0  (?) 

The  network  work  capacity  rate  W^et(n)  in  our  model 
is  also  calculated  using  two  components.  One  component 
is  the  application-specific  term  SizeraskX  fer>  which  rep¬ 
resents  the  amount  of  data  transferred  between  a  master 
process  and  a  slave  process  for  each  task  in  an  application. 
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If  the  task  data  transfer  sizes  are  a  variable  quantity  over 
time,  perhaps  due  to  data  dependencies  in  the  application, 
we  must  calculate  an  average  data  transfer  value  that  rep¬ 
resents  expected  steady-state  communication  behavior  over 
the  time  of  an  entire  application  run. 

The  second  component  used  in  calculating  network  work 
capacities  on  a  network  resource  n  is  a  dynamic  predic¬ 
tion  of  expected  available  network  bandwidth  BW Net(n), 
which  we  obtain  from  the  NWS.  The  network  work  capac¬ 
ity  rates  can  be  calculated  using  application-specific  and  dy¬ 
namic  components  as  shown  below.  The  input  parameters 
are  again  summarized  in  Table  1 . 

WNet  (n)  =  BW  Net  (n)  /  SizeTaskX  fer  (8) 

Having  constructed  a  set  of  resource  constraint  values 
to  help  model  the  performance  of  a  Grid  environment,  we 
should  also  discuss  an  obvious  limitation  of  our  approach. 
For  each  of  the  terms  derived  in  this  section,  we  have  gen¬ 
erated  an  average-value  expression  for  use  in  our  steady- 
state  application  performance  model.  Yet  each  of  the  prop¬ 
erties  being  modeled  might  in  reality  exhibit  considerable 
variability  over  time,  either  due  to  time- varying  load  con¬ 
ditions  or  data  dependent  behavior  of  the  application  being 
run.  Our  experience  has  been  that  despite  the  limitations  of 
converting  many  variable  terms  to  average  steady-state  val¬ 
ues,  our  approach  still  yields  a  performance  model  which 
can  do  a  good  job  at  estimating  application  performance, 
and  which  provides  an  effective  tool  for  helping  to  solve  the 
resource  selection  problem,  which  we  discuss  next. 

4.  Selecting  a  master  and  the  slaves 

Given  the  work-rate  performance  model  described  in 
Section  2  and  a  logical  representation  of  the  work  capacities 
of  Grid  resources,  we  can  now  consider  strategies  for  select¬ 
ing  processors  to  host  the  master  and  slave  processes.  These 
are  important  issues  for  master/slave  applications  running 
in  Grid  environments,  where  users  may  be  able  to  choose 
from  among  many  different  types  of  resources,  and  where 
availability  of  these  resources  may  change  over  time. 

Selection  of  the  right  processor  to  host  the  master  pro¬ 
cess  can  significantly  impact  the  overall  application  per¬ 
formance,  as  the  following  section  will  show.  Knowing 
which  master  placement  produces  the  best  application  per¬ 
formance  might  also  influence  other  important  decisions, 
such  as  where  to  efficiently  position  input  and  output  files 
for  the  application.  Selection  of  the  right  set  of  processor 
resources  to  host  the  slave  processes  has  two  goals;  (1)  se¬ 
lecting  enough  resources  from  the  available  set  to  produce 
the  best  achievable  application  performance, and  (2)  limit¬ 
ing  the  selection  to  resources  that  will  actually  benefit  ap¬ 
plication  performance.  The  second  goal  is  important  for 


Grid  environments  where  resources  can  be  shared  by  many 
users,  and  resources  can  be  owned  and  managed  by  many 
different  organizations.  In  these  environments,  it  is  desir¬ 
able  that  applications  use  only  those  resources  they  really 
need;  thereby  allowing  limited  pools  of  shared  resources  to 
satisfy  the  largest  number  of  users.  We  will  first  consider 
the  issue  of  selecting  the  right  host  for  the  master  process. 

4.1.  Master  selection  example 

In  a  heterogeneous  system,  selection  of  a  location  for 
the  master  process  very  strongly  depends  on  the  deliverable 
work  capacity  of  candidate  resources.  Consider  the  logi¬ 
cal  Grid  configuration  shown  back  in  Figure  1,  where  four 
processors  are  connected  by  a  system  of  three  networks.  We 
have  labeled  the  network  resources  with  values  representing 
the  WNet  capacity  terms.  The  processor  resources,  shown 
as  circles  in  the  diagram,  have  been  labeled  with  two  values: 
a  WsiaveCPU  capacity  term  on  top,  and  a  WMasterCPU  ca¬ 
pacity  term  on  the  bottom.  All  capacity  terms  are  in  units 
of  tasks  per  second. 

For  this  simple  example  system,  we  can  determine  the 
assignment  of  the  master  process  to  a  processor  that  gives 
us  the  greatest  achievable  work  flow.  If  processor  A  is  se¬ 
lected  to  host  the  master  process,  processor  B  is  able  to  pro¬ 
vide  60  tasks/sec  work  rate  as  a  slave.  In  addition,  a  max¬ 
imum  of  50  tasks/sec  worth  of  data  can  be  transferred  over 
network  Net3,  a  work  rate  which  can  be  supplied  by  proces¬ 
sor  C.  The  total  expected  application  work  rate  with  proces¬ 
sor  A  hosting  the  master  is  therefore  1 10  tasks/sec.  If  we 
consider  selecting  processor  C  to  host  the  master  process, 
we  observe  that  processor  D  can  deliver  a  work  rate  of  10 
tasks/sec  working  as  a  slave.  In  addition,  we  can  transfer  a 
maximum  of  50  tasks/sec  worth  of  data  over  network  Net3, 
which  can  be  supplied  by  processor  A.  It  becomes  apparent 
that  processor  C  is  constrained  from  achieving  any  higher 
application  work  rate  by  the  limitations  on  the  Net3  capac¬ 
ity,  as  well  as  the  capacity  of  processor  C  to  serve  as  the 
master  host,  to  no  more  than  60  tasks/sec.  We  could  pro¬ 
ceed  in  a  similar  manner  for  all  the  processors,  and  derive 
expected  application  work  rates  for  each  candidate.  Table  2 
shows  one  set  of  possible  outcomes  for  this  process.  It  is 
apparent  from  the  last  column  in  Table  2  that  processor  B 
is  the  best  choice,  yielding  a  potential  application  work  rate 
of  130  tasks/sec. 

4.2.  Selecting  the  master 

More  generally,  we  have  developed  a  basic  algorithm  for 
finding  the  best  performing  host  for  the  master  process.  It  is 
based  on  the  well-known  maximum-flow  algorithm  by  Ford 
and  Fulkerson  [7].  In  this  algorithm,  we  keep  augmenting 
the  estimated  flow  rate  for  each  master  host  by  adding  the 
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Master 
Location  m 

W^MasterCPU 

(m) 

SlaveRate 

(m,A) 

SlaveRate 

0 m,B ) 

SlaveRate 

(m,C) 

SlaveRate 

(m,D) 

AppRate(m) 

A 

200 

0 

60 

50 

0 

110 

B 

150 

80 

0 

50 

0 

130 

C 

60 

50 

0 

0 

10 

60 

D 

90 

40 

0 

50 

0 

90 

Table  2.  Work  rates  resulting  from  master  placement  decision. 


contributions  of  slave  processors.  Additional  contributing 
slaves  are  selected  first  from  those  on  the  same  local  net¬ 
work  as  the  master.  This  continues  until  either  all  of  the 
slaves  have  been  included,  or  no  further  slave  work  rates 
can  be  incorporated  because  of  either  capacity  limitations 
on  network  resources,  or  capacity  limitations  of  the  master 
processor  itself.  If  further  capacity  is  available  from  pro¬ 
cessors  on  non-local  networks,  they  are  added  one  by  one 
to  the  accumulated  master  total  until  no  further  additions 
are  possible  without  exceeding  one  of  the  resource  capaci¬ 
ties.  Figure  2  illustrates  our  basic  algorithm  for  finding  the 
best  performing  master  host.  Upon  termination  of  the  algo¬ 
rithm,  the  processor  with  the  highest  calculated  work  rate  is 
selected  as  the  master. 

4.3.  Complexity 

In  deriving  the  complexity  of  our  algorithm,  we  note 
that  our  simplified  logical  representation  of  network  con¬ 
figuration  reduces  the  entire  system  to  sets  of  processors 
connected  by  local  networks.  Each  of  these  local  networks 
is  then  connected  to  other  local  networks  by  at  most  one 
level  of  remote  networking.  With  this  logical  topology, 
data  transfers  between  slaves  on  the  same  local  network 
pass  through  only  one  level  of  networking,  and  encounter 
only  one  network  resource  constraint.  Data  transfers  be¬ 
tween  slaves  located  on  different  local  networks  will  pass 
through  at  most  three  levels  of  networking,  and  must  satisfy 
at  most  three  networking  constraints.  All  slave  work  rates 
must  meet  the  resource  constraints  of  the  master  processor. 
With  this  arrangement,  there  are  at  most  four  tests  of  con¬ 
straints  in  our  algorithm  that  have  to  be  checked  for  each 
master  and  slave  pairing. 

If  we  have  n  processors  in  our  system,  then  each  master 
candidate  can  have  at  most  n  —  1  slaves,  and  each  individual 
master  work  rate  calculation  takes  0(n)  time  to  calculate. 
Calculating  maximum  work  rates  for  all  n  possible  mas¬ 
ter  candidates  thus  takes  0(n2)  time.  Since  our  algorithm 
requires  only  simple  compare  and  accumulation  operations 
for  each  resource  constraint  test,  the  entire  algorithm  is  ef¬ 
ficient  for  the  numbers  of  processors  and  networks  we  cur¬ 
rently  find  in  Grid  environments  available  to  a  typical  user. 


4.4.  Selecting  the  slaves 

After  selecting  the  master  processor,  we  turn  to  selec¬ 
tion  of  the  slave  processors.  The  issue  is  to  select  a  set  of 
processors  for  hosting  slave  processes  that  will  deliver  good 
aggregate  performance.  One  approach  is  to  start  with  the  set 
of  slave  processors  found  in  our  master  selection  algorithm 
that  yielded  the  highest  expected  application  performance. 
Our  algorithm  keeps  track  of  this  set  in  the  Found(m )  list, 
a  list  containing  slaves  used  by  the  algorithm  to  calculate  the 
maximum  work  rate  for  an  application  with  processor  m  as 
the  master  host.  Our  master  selection  algorithm  ensures  that 
this  set  of  processors  results  in  work  flows  that  fall  within 
the  constraints  imposed  by  resource  capacity  limitations. 

In  numerous  experimental  trials  using  the  set  of  proces¬ 
sors  from  Fcmnd(m)  as  slave  hosts,  we  observed  that  the 
slave  processors  were  often  not  delivering  the  maximum 
work  rate  values  we  expected  in  our  algorithm.  Observa¬ 
tions  of  selected  slaves  showed  the  reduction  in  slave  per¬ 
formance  was  due  to  the  presence  of  unaccounted  idle  time, 
periods  of  time  when  slave  processors  were  not  doing  use¬ 
ful  work.  An  explanation  for  the  observed  idle  times  comes 
from  observing  the  manner  in  which  tasks  are  distributed  to 
slave  processors  from  the  master.  Each  master/slave  appli¬ 
cation  we  tested  maintained  a  queue  of  available  tasks  on 
the  master  process,  and  distributed  new  tasks  to  individual 
slave  processes  upon  request  (a  very  commonly  used  tech¬ 
nique).  Because  of  contention  for  shared  resources,  such  as 
networks  and  the  master  processor,  delays  sometimes  oc¬ 
curred  between  the  time  a  slave  processor  finished  one  task 
and  the  time  at  which  the  next  task  appeared  for  processing. 
These  delays  appeared  as  idle  time  in  our  observations  of 
the  slaves.  With  a  minimum  set  of  slaves  selected  to  achieve 
the  desired  work  rate,  the  unexpected  idle  time  in  the  slaves 
resulted  in  a  reduction  of  the  actual  total  work  rate  achieved. 

The  work  flow-rate  performance  model  correctly  deter¬ 
mines  possible  application  performance  based  on  resource 
capacity  limits.  Our  master  selection  algorithm  uses  this 
performance  model,  and  in  the  process  identifies  a  set  of 
slaves  which  delivers  this  performance,  assuming  that  each 
slave  delivers  its  maximum  work  rate.  Experimentation  has 
shown  that  sometimes  these  slaves  actually  deliver  less  than 
their  predicted  maximum  work  rates,  resulting  in  less  per- 
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For  all  networks  k 

Calculate  maximum  network  capacity  W^et  (jfc) 

For  all  processors  j 

Calculate  maximum  master  processor  capacity  WMasterCPu{j) 

Calculate  maximum  slave  processor  capacity  WSiaveCPu(j ) 

For  each  candidate  master  processor  p  on  local  network  n 
Set  sum  for  candidate  slave  work  rates  CandRate(p)  =  0 
Set  found  set  Found(p)  to  empty 
For  all  networks  k 

Set  network  utilization  sum  NetUtil(k)  =  0 
Get  maximum  capacity  WNet  (n)  of  local  network  n 
Get  maximum  master  processor  capacity  WMasterCPu(p ) 

While  CandRate(p)  <  Wjvet(n)  and  CandRate(p )  <  WMasterCPuip ) 

Select  new  processor  s  from  same  local  network  as  p  with 
the  largest  available  WSiaveCPu(s)  value 
Get  slave  processor  capacity  WsiaVeCPu{s) 

Get  fraction  F  of  WsiaveCPu{s )  that  will  not  cause 
utilization  NetUtil(n )  to  exceed  W^et(n) 

Add  F  to  CandRate(p) 

Add  F  to  NetUtil(n ) 

Add  processor  s  to  found  set  Found(p ) 

Total  candidate  work  rate  CandRate(p)  =  min (CandRate(p),  WM aster cpu(p)) 
Total  local  network  utilization  NetUtil(n)  =  CandRate(p) 

While  CandRate(p)  <  WNet(n)  and  CandRate(p)  <  WMasterCPu(p ) 

Select  new  processor  q  from  outside  local  network  with 
the  largest  available  WSiaveCPu(q)  value 
Get  slave  processor  capacity  WsiaveCPu{q ) 

Get  fraction  F  of  WsiaveCPu(q)  that  will  not  cause 
utilization  NetUtil(i)  to  exceed  Wj^et  (0  for  any  network  i 
Add  F  to  CandRate(p) 

Add  F  to  NetUtil(n) 

Add  F  to  other  NetUtil(k)  where  network  k  is  involved  in 
communications  between  processors  p  and  q 
Add  processor  q  to  found  set  Fcmnd(p) 

Select  processor  p  with  largest  CandRate(p)  as  master 


Figure  2.  Algorithm  for  finding  best  processor  for  the  master. 
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formance  than  resource  capacity  constraints  would  allow. 
One  way  to  get  application  performance  back  up  to  pre¬ 
dicted  levels  is  to  add  additional  slave  processors  to  the 
originally  selected  mix,  thereby  raising  the  effective  slave 
work  rates  achieved  up  to  expected  values.  Our  goal  is  to 
compensate  for  lost  performance  due  to  idle  time  on  the  in¬ 
dividual  slave  processors,  while  keeping  the  number  of  ad¬ 
ditional  processors  down  to  the  minimum  needed  to  accom¬ 
plish  this  goal. 

Our  steady-state  flow-rate  performance  model  was  not 
useful  in  helping  to  decide  how  many  slaves  to  add  to  in¬ 
crease  effective  performance  because  it  could  not  account 
for  idle  times  caused  by  slaves  waiting  for  new  tasks  to  ar¬ 
rive.  To  address  this  shortcoming  and  others  in  our  steady- 
state  approaches  to  performance  analysis,  we  developed  a 
master/slave  application  performance  simulator  to  provide 
significant  new  capabilities.  We  discuss  this  simulator  and 
how  it  can  be  used  to  help  solve  the  slave  selection  problem 
in  the  following  subsections. 

4.5.  An  application  performance  simulator 

We  originally  developed  a  master/slave  application  per¬ 
formance  simulator  to  provide  detailed  predictions  of  per¬ 
formance  and  resource  behavior  for  applications  running  in 
Grid  environments.  One  effective  use  we  have  found  for  this 
simulator  is  to  help  determine  how  many  additional  slave 
processors  might  be  added  to  a  predicted  group  of  master 
and  slave  processors  to  make  up  for  performance  losses  due 
to  slave  idle  time. 

At  its  core,  our  simulator  is  a  set  of  routines  which  model 
the  behavior  of  tasks  as  they  pass  through  a  system  com¬ 
prised  of  two  kinds  of  resources:  processors  and  networks. 
The  resources  are  modeled  as  single  servers  with  first-in- 
first-out  input  queues.  Service  times  for  the  processor  re¬ 
sources  determine  how  long  a  task  has  control  of  the  pro¬ 
cessor  before  relinquishing  the  resource  to  the  next  task 
in  the  input  queue,  and  are  dependent  on  the  same  pro¬ 
cessor  availability  parameters  Availcputt)  and  estimated 
task  execution  times  TsiaveCPud)  and  TMasterCPui})  de¬ 
veloped  earlier  for  our  flow-rate  model.  Service  times  for 
the  network  resources  determine  how  long  a  network  re¬ 
source  is  committed  to  servicing  data  transfers  for  each 
task,  and  are  dependent  on  the  same  network  bandwidth  pa¬ 
rameters  BWxet(n)  and  size  of  the  data  transfers  values 
SizeraskX  fer  developed  for  the  flow-rate  model  presented 
earlier.  In  addition,  all  of  the  parameters  can  be  adjusted  to 
use  either  static  steady-state  values  like  those  in  the  flow- 
rate  performance  model,  or  more  dynamic  data  inputs  such 
as  statistical  distributions  or  actual  measured  trace  values 
from  application  runs.  Network  connectivity  is  represented 
using  the  same  graph  G,  an  output  of  the  ENV  tool,  used  in 
the  work  flow-rate  performance  model. 


The  simulator  is  written  in  highly  portable  C-language 
code,  with  the  help  of  a  simulation  library  package  called 
Sim-H-  [4].  This  simulator  can  be  easily  embedded  into 
other  programs,  such  as  an  application  scheduler,  to  pro¬ 
vide  detailed  predictions  of  application  performance  and 
resource  utilization  levels.  It  is  particularly  useful  for  ob¬ 
serving  the  performance  impact  of  changing  application  or 
resource  parameters. 

4.6.  Using  simulation  to  enhance  slave  selection 

Our  algorithm  for  finding  the  correct  set  of  slave  proces¬ 
sors  starts  with  the  master  processor  m  and  the  Found(m ) 
set  of  slaves  from  the  master  selection  algorithm.  The  sim¬ 
ulator  is  run  with  these  machines  as  the  target  environment, 
using  the  same  values  for  resource  capacities  as  were  used 
in  the  master  selection  algorithm.  Results  from  the  simu¬ 
lation  are  checked  to  see  if  any  idle  time  on  the  simulated 
slaves  results  in  a  significant  decrease  in  overall  applica¬ 
tion  performance.  If  a  substantial  performance  decrease  is 
found,  resource  utilization  figures  from  the  simulation  are 
checked  to  see  where  additional  processors  might  be  added 
without  exceeding  existing  resource  constraints.  If  more 
slave  processors  are  available  to  be  added  that  will  not  vi¬ 
olate  any  known  resource  constraints,  they  are  added  to  the 
set  of  found  slaves.  A  new  system  configuration  with  the 
additional  processors  added  in  is  constructed  and  simulated 
once  again.  The  process  of  slave  additions  and  testing  by 
simulation  repeats  until  either  there  are  no  further  perfor¬ 
mance  gains  realized  by  adding  more  slave  processors,  or  no 
more  processors  can  be  found  and  placed  without  exceed¬ 
ing  one  of  the  known  resource  capacity  constraints.  Figure  3 
illustrates  our  algorithm  for  finding  the  set  of  slave  proces¬ 
sors. 

The  algorithm  given  above  makes  good  use  of  simulator 
results  which  calculate  predicted  resource  utilization  values 
for  every  resource  in  the  system.  These  values  allow  us 
to  quickly  identify  where  in  the  system,  if  anywhere,  slave 
processors  might  be  added  to  improve  application  perfor¬ 
mance.  In  practice,  the  number  of  times  the  simulation  cy¬ 
cle  needs  to  be  run  is  small  as  the  process  quickly  converges 
to  a  situation  where  either  additional  performance  gains  are 
insignificant,  or  no  further  additions  can  be  made  without 
exceeding  a  resource  constraint. 

5.  Experimental  results 

In  this  section,  we  describe  experiments  whose  goal  it 
is  to  test  the  usefulness  and  accuracy  of  our  work-rate  per¬ 
formance  model  and  application  performance  simulator,  as 
well  as  the  performance  of  our  algorithms  for  selecting  mas¬ 
ter  and  slave  processors. 
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Run  master  selection  algorithm  to  get  master  processor  m,  set  of  slaves 
Found(m ),  and  predicted  application  work  rate  R 

Run  application  performance  simulator  using  m  and  Found(m)  to  get 
simulated  work  rate  S  and  slave  utilization  values  U(s) 

While  S  less  than  R 

Using  U (s),  check  which  slaves  s  in  Found(m)  have  large 
simulated  idle  times 

Find  additional  processors  A '  that  make  up  for  idle  time  without 
exceeding  any  W]\fet(n)  or  W m asterC pu(m)  constraints 

Add  processors  A!  to  Found(m)  to  form  Found!  (m) 

Run  simulator  using  m  and  Found1  (m)  processors  to  get  new 
simulated  work  rate  S '  and  slave  utilization  values  U'(s) 

If  S  =  S'  or  Found(m)  =  Found!  (m) 

Return  Found{m)  as  slave  solution 

Set  S  equal  to  S',  all  U (5)  equal  to  Ur(s) 

Set  Found{m)  equal  to  Found' (m) 

Return  Found(m)  as  slave  solution 


Figure  3.  Algorithm  for  finding  best  processors  for  slaves. 


We  use  as  an  application  test  suite  three  applications  cho¬ 
sen  to  represent  a  spectrum  of  potential  master/slave  dis¬ 
tributed  applications.  The  applications  were  selected  and 
implemented  to  test  the  sensitivity  of  our  approach  to  com¬ 
putation  and  communication  granularity.  Our  master/slave 
implementation  of  the  Mandelbrot  image  application  is  ex¬ 
pected  to  display  a  relatively  high  sensitivity  to  communi¬ 
cation  constraints,  as  the  amount  of  image  data  transferred 
during  execution  is  large  compared  to  the  overall  computa¬ 
tion  time.  At  the  other  extreme  is  the  NAS  Parallel  Bench¬ 
marks’  EP  [18]  application,  which  performs  relatively  little 
data  transfer  compared  to  the  time  spent  computing.  The 
Povray  [11]  ray- tracing  application  falls  somewhere  in  the 
middle,  with  the  transfer  of  one  fourth  the  amount  of  image 
data  as  the  Mandelbrot  application  which  was  spread  out 
over  a  longer  computation  time..  Each  of  the  applications 
was  initially  benchmarked  on  all  target  processor  types  to 
produce  the  application-specific  parameters  needed  for  our 
performance  analysis  tools.  The  applications  are  summa¬ 
rized  in  Table  3. 


5.1.  Experimental  design 

In  the  experiments,  we  compared  predicted  execution 
time  (resulting  from  our  performance  model),  simulated 
execution  time  (using  the  application  simulator),  and  ac¬ 
tual  execution  time  (determined  from  experimental  runs). 
All  comparisons  were  made  in  a  non-dedicated  environ¬ 
ment  where  the  load  traces  used  for  the  predicted  and  simu¬ 
lated  execution  times  were  determined  from  the  NWS  load 
trace  of  the  actual  execution  time  runs.  We  used  identical 
parameter  inputs  for  network  configuration,  resource  con¬ 
straints,  and  application  characteristics  in  both  work-flow 
analysis  and  performance  simulation  tools.  In  this  way,  we 
attempted  to  compare  each  set  of  execution  times  under  the 
same  environmental  conditions. 

The  target  experimental  platform  was  a  heterogeneous 
mix  of  Intel  processor-based  machines  running  Linux,  and 
Sun  SPARC  machines  running  Solaris  located  in  the  Paral¬ 
lel  Computation  Laboratory  in  the  Department  of  Computer 
Science  and  Engineering  at  the  University  of  California, 
San  Diego.  The  experiments  were  run  with  all  machines 
in  non-dedicated  mode,  but  outside  loading  from  compet- 
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Name 

Description 

Emphasis 

Mandelbrot 

Povray 

NBPEP 

parallel  fractal  image  generator 

parallel  implementation  of  popular  ray-tracer 

NAS  Parallel  Benchmark  EP  variant 

communication 

both 

computation 

Table  3.  List  of  applications  used  in  experiments. 


ing  jobs  was  observed  to  be  relatively  light  for  most  of  the 
machines  during  the  course  of  experimentation. 

5.2.  Results 

In  the  first  set  of  experiments,  we  ran  the  test  suite  of 
applications  on  a  set  of  nine  workstations  shown  in  Ta¬ 
ble  4.  For  the  three  applications,  trials  were  run  with  each 
of  the  nine  processors  being  selected  to  run  as  the  master 
while  the  other  eight  were  included  to  run  as  slaves.  In 
all  cases,  the  work  flow-rate  problem  was  solved  for  each 
configuration  of  master  and  slaves  to  give  the  expected  ap¬ 
plication  execution  time,  shown  as  the  light  bars  in  Fig¬ 
ures  4-6.  The  application  performance  simulator  was  run 
for  all  cases  to  give  a  predicted  application  execution  time, 
shown  by  the  middle  bars  in  the  graphs.  And  finally,  the 
real  applications  were  run  on  each  configuration  and  their 
execution  times  recorded  to  appear  as  the  dark  bars  on  the 
graphs..  Figure  4  shows  the  results  while  running  the  rela¬ 
tively  communication-heavy  Mandelbrot  application.  Fig¬ 
ure  5  shows  the  same  set  of  execution  times  for  the  more 
balanced  Povray  application,  while  Figure  6  shows  exe¬ 
cution  times  for  the  computation-intensive  NAS  Parallel 
Benchmarks’  EP  application. 

In  these  experiments,  the  work-rate  performance  model 
would  have  done  a  good  job  of  identifying  the  correct  mas¬ 
ter  host  to  produce  the  fastest  application  execution  times. 
In  the  Mandelbrot  series  of  experiments,  the  machine  thing 1 
was  calculated  to  yield  the  lowest  execution  time,  which 
was  confirmed  in  the  actual  application  run.  For  this  ap¬ 
plication  the  highest  execution  time,  achieved  with  the  ma¬ 
chine  named  lorax ,  took  170%  longer  to  finish  than  the  best 
choice.  For  the  other  two  applications,  the  work-rate  per¬ 
formance  model  estimates  of  execution  time  again  showed 
results  which  correlated  closely  with  actual  application  run 
times.  For  these  applications,  which  exhibited  lower  de¬ 
pendence  on  network  constraints,  the  differences  between 
the  worst  and  best  performers  was  smaller:  about  25%  for 
Povray  and  10%  for  NAS  EP.  The  work-rate  based  perfor¬ 
mance  model  correctly  ordered  master  performance  for  both 
communication  and  computation  constrained  applications. 
The  results  also  show  that  the  application  performance  sim¬ 
ulator  did  a  good  job  of  tracking  the  actual  application  exe¬ 
cution  times  as  well. 

The  experimental  results  show  a  small  number  of  cases 


where  the  execution  time  was  significantly  underestimated 
for  the  Mandelbrot  application.  Analysis  of  experimen¬ 
tal  results  leads  us  to  believe  the  discrepancy  in  pre¬ 
dicted  and  actual  performance  on  the  communication-heavy 
application  was  due  to  inadequate  benchmarking  of  the 
WMasterCPU  constraint  terms.  Actual  application  perfor¬ 
mance  is  worse  than  that  predicted  by  both  the  work-flow 
model  and  the  simulator  because  both  tools  overestimated 
the  capacity  of  the  single  master  process  to  process  in¬ 
coming  data  and  respond  to  new  task  requests.  When  the 
real  master  process  fails  to  keep  up  with  projected  work 
rates,  the  overall  application  work  rate  is  reduced  and  ex¬ 
ecution  time  becomes  relatively  larger.  Improved  methods 
for  benchmarking  master  processor  performance  are  cur¬ 
rently  being  developed  to  overcome  this  shortcoming. 

In  the  second  set  of  experiments,  we  look  at  two  of  our 
applications:  Mandelbrot  and  Povray.  In  these  trials  we  pick 
a  specific  host  for  the  master  process,  then  run  our  appli¬ 
cation  for  different  numbers  of  slave  processes.  We  show 
measured  execution  times  and  simulated  execution  times 
for  our  two  applications  as  we  increase  the  number  of  slave 
processors. 

Figure  7  shows  results  with  our  communication¬ 
intensive  Mandelbrot  application  for  two  different  choices 
of  the  master  host.  These  results  show  that  the  number 
of  slaves  which  can  beneficially  be  employed  varies  un¬ 
der  different  conditions,  and  is  heavily  constrained  by  the 
network  speed  of  the  master  process  host.  Figure  8  shows 
results  with  the  Povray  application,  whose  performance  is 
less  dominated  by  communication  costs.  In  our  test  envi¬ 
ronment,  this  application  shows  more  scalable  performance 
than  Mandelbrot,  but  eventually  also  reaches  a  point  where 
additional  processors  do  not  significantly  decrease  execu¬ 
tion  time.  Results  are  shown  for  only  one  master  case  be¬ 
cause  data  for  other  cases  produces  almost  identical  graphs. 
Results  for  our  third  application,  NPB  EP,  are  not  shown 
here,  but  they  are  very  similar  to  those  for  povray,  with  sim¬ 
ulation  predicted  run  times  and  actual  application  run  times 
very  close  for  all  numbers  of  processors.  These  results  indi¬ 
cate  that  for  our  representative  examples,  the  performance 
simulator  can  be  a  useful  tool  to  help  predict  the  points  at 
which  either  additional  slaves  should  be  added  to  a  com¬ 
putation  to  increase  performance,  or  when  additional  slaves 
cease  to  have  any  useful  effect. 
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Master  Selection  Results 


Hosts 


Figure  6.  Execution  time  of  computation-intensive  application  while  varying  master  host. 


Name 

Processor 

Network 

os 

azulejo 

kingkong 

kongo 

lorax 

magie 

saltimbanco 

sojourner 

tandem 

thing  1 

Intel  Pentium  Pro  200 

Sun  UltraSPARC-IIi  333MHz 
Sun  UltraSPARC  166MHz 

Sun  microSPARC  II  85MHz 
Intel  Pentium  Pro  200 

Intel  Pentium  11-400 

Intel  Pentium  11-266 

Intel  Pentium  11-300 

Sun  UltraSPARC  200MHz 

100  Mbit/s  ethernet 
100  Mbit/s  ethernet 
100  Mbit/s  ethernet 
100  Mbit/s  ethernet 
10  Mbit/s  ethernet 

10  Mbit/s  ethernet 

10  Mbit/s  ethernet 
100  Mbit/s  ethernet 
100  Mbit/s  ethernet 

Linux  2.0.36 
Solaris  2.6 
Solaris  2.6 
Solaris  2.6 
Linux  2.1.125 
Linux  2.1.125 
Linux  2.2.9 
Linux  2.0.36 
Solaris  2.6 

Table  4.  Partial  list  of  heterogeneous  mix  of  machines  used  in  experiments. 


6.  Related  Work 

Many  different  approaches  to  predicting  the  performance 
of  parallel  applications  on  distributed-memory  machines 
have  appeared  in  the  literature.  A  partial  summary  of  some 
earlier  efforts  can  be  found  in  [10].  Unfortunately,  these 
approaches  often  suffered  from  either  limited  accuracy  un¬ 
der  real-world  conditions  (caused  by  making  many  simpli¬ 
fying  assumptions),  or  from  excessive  complexity  when  ei¬ 
ther  constructing  or  using  the  models.  Our  approach  to  per¬ 
formance  prediction  focuses  on  achieving  useful  levels  of 
prediction  accuracy  while  limiting  model  complexity  and 
allowing  efficient  measurement  and  quantification  of  impor¬ 
tant  model  parameters. 

The  application  of  performance  prediction  to  the  prob¬ 


lem  of  resource  selection  has  also  been  addressed  recently 
by  Weissman  and  Zhao  [17].  In  their  work,  Weissman  and 
Zhao  use  heuristics  to  select  a  number  of  candidate  config¬ 
urations,  then  employ  cost  functions  to  derive  computation 
and  communication  times  for  each  configuration.  They  then 
select  the  configuration  yielding  the  lowest  total  cost.  Our 
approach  to  resource  selection  efficiently  evaluates  appli¬ 
cation  performance  for  different  configurations  using  only 
simple  constraint  calculations. 

Subhlok,  Lieu  and  Lowekamp  [15]  have  looked  at  au¬ 
tomatically  selecting  processor  nodes  for  applications  run¬ 
ning  on  high-speed  networks.  For  their  results,  Subhlok, 
Lieu  and  Lowekamp  present  algorithms  which  allow  them 
to  automatically  select  nodes  with  three  different  goals: 
maximizing  computation  capacity,  maximizing  communi- 
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Slave  Selection 


Slave  Selection 


Figure  7.  Application  performance  with  vary¬ 
ing  numbers  of  slaves. 


Figure  8.  Application  performance  with  vary¬ 
ing  numbers  of  slaves. 


cation  capacity,  or  balancing  computation  and  communica¬ 
tion.  Their  paper  does  not  explain  how  the  correct  goal  is 
selected  to  match  specific  application  characteristics  in  or¬ 
der  to  give  optimum  performance.  Our  approach  automat¬ 
ically  determines  performance  bottlenecks  based  on  both 
computation  and  communication  constraints,  and  finds  the 
best  performing  configuration  for  all  cases. 

7.  Summary 

In  this  paper,  we  have  described  a  rate-based  perfor¬ 
mance  model  for  master/slave  applications  running  on  dis¬ 
tributed  heterogeneous  processors  and  networks.  By  param¬ 
eterizing  this  steady-state  performance  model  with  some 
dynamic  run-time  information,  we  are  able  to  accurately 
predict  maximum  achievable  application  performance  rates 
-  even  in  the  cases  where  application  characteristics  and  re¬ 
source  behavior  are  not  steady  over  time. 

We  have  also  described  an  application  performance  sim¬ 
ulator  which  accurately  simulates  the  dynamic  interaction 
of  a  master/slave  application  with  a  defined  configuration 
of  performance  constrained  resources.  This  simulator  al¬ 
lows  for  a  detailed  analysis  of  where  performance  bottle¬ 
necks  due  to  resource  limitations  may  occur  in  an  applica¬ 
tion.  This  kind  of  detailed  information  about  how  applica¬ 
tions  interact  with  resources  in  a  Grid  environment  can  be 
very  valuable  for  resource  selection  at  application  runtime, 
advanced  application  and  platform  planning,  and  program 


development  activities.  The  key  to  our  success  with  our 
performance  prediction  tools  has  been  the  identification  of 
a  common  set  of  application  and  resource  parameters  which 
could  be  quantified  and  measured,  and  which  captured  both 
the  static  and  dynamic  aspects  of  application  performance 
in  Grid  environments. 

Based  on  the  effectiveness  of  our  performance  prediction 
tools,  we  have  developed  algorithms  for  master  and  slave 
resource  selection  on  Grid  platforms.  These  algorithms  en¬ 
able  the  selection  of  a  master  processor  and  a  set  of  slave 
processors  which  allow  maximum  application  performance 
to  occur.  Actually  achieving  the  maximum  application  per¬ 
formance  in  dynamic  Grid  environments  may  also  require 
the  use  of  other  run-time  techniques  to  handle  issues  like 
load  balancing  and  fault  tolerance.  These  are  issues  we  are 
actively  researching,  and  will  be  the  subject  of  future  publi¬ 
cations. 

Some  brief  experimental  data  was  presented  to  verify 
that  both  our  performance  prediction  tools  and  our  strate¬ 
gies  for  selecting  master  and  slave  resources  were  sound. 
We  are  currently  integrating  the  performance  tools  and  re¬ 
source  selection  strategies  into  an  AppLeS  [2]  Grid  ap¬ 
plication  scheduler  with  the  goal  of  providing  an  auto¬ 
matic  mechanism  for  high-quality  distributed  master/slave 
scheduling  in  heterogeneous  and  dynamic  Grid  environ¬ 
ments. 

In  the  future,  we  would  like  to  extend  the  work-rate- 
based  performance  model  to  other  common  classes  of  paral- 
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lei  computing  in  Grid  environments.  We  would  also  like  to 
study  whether  other  physical  resource  characteristics,  such 
as  available  memory,  might  be  beneficial  to  include  in  our 
constraint  analyses.  Our  experience  has  shown  that  the  idea 
of  estimating  application  performance  by  accounting  for  ap¬ 
plication/resource  constraints  appears  promising  as  a  tool 
for  enabling  more  effective  application  scheduling. 
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Abstract 

In  this  work  we  present  the  results  of  a  project  aimed 
at  assembling  an  hybrid  massively  parallel  machine ,  the 
PQE1  prototype,  devoted  to  the  simulation  of  complex 
physical  models.  The  analysis  of  some  of  the  existing 
parallel  architectures  has  revealed  that  general-purpose 
machines  are  largely  over-dimensioned  and  often  perform 
inefficiently  in  grand-challenge  scientific  applications. 
We  have  thus  developed  an  heterogeneous  parallel  system 
which  matches  task-heterogeneity  with 
architecture-heterogeneity:  in  fact  special-purpose 
massively  parallel  architectures,  when  coupled  to 
general-purpose  machines,  are  able  to  efficiently  satisfy 
the  requirements  of  complex  scientific  computing.  We 
present  the  HW  structure  and  the  SW  tools  developed  for 
the  PQE1  prototype.  Starting  from  the  concept  of 
machine-granularity  and  task-granularity,  we  show  the 
necessity  to  exploit  both  high  granularity  and  low 
granularity  parallelism  to  efficiently  use  the  PQE1 
system.  Some  examples  describing  application  fields  in 
which  the  PQE1  prototype  has  been  successfully  used  are 
briefly  described. 


1.  Introduction 

Technical  applications  (image  processing,  real-time 
control,...)  and  simulation  of  complex  models  used  in 
scientific  applications  (quantum  chemistry,  weather 
forecast,  electromagnetic  compatibility...)  require 
sustained  computational  powers  of  the  order  of  tens  (or 
hundreds)  of  Gflops  (lGflops  =  109  floating  point 
operations  per  second).  Massively  parallel  processing 
seems  to  be  the  only  practical  way  to  reach  these  figures. 
To  date,  commodity  off-the-shelf  processors  are  able  to 


provide  peak  performance  in  the  range  of  l-s-2  Gflops  (for 
example,  the  667  MHz  Alpha  21264  chip  has  a  peak 
performance  of  1.3  Gflops  [1]):  hundreds  of  those 
processors  can  be  coupled,  for  instance,  up  to  reaching  the 
desired  sustained  performances. 

The  Accelerated  Strategic  Computing  Initiative  (ASCI, 
[2])  and  the  Path  Forward  project,  finalized  to  build  very 
powerful  parallel  machines  to  implement  extremely 
complex  simulations,  have  produced,  to  date,  the 
installation  of  several  general  purpose  platforms: 

1.  ASCI  Red:  composed  by  9,216  Pentium  Pro 
processors,  has  584.5  Gbytes  of  RAM,  bi-directional 
cross-section  bandwidth  of  51.6  Gbyte/sec  and  peak 
performance  of  1.8  Tflops; 

2.  ASCI  Blue  Mountain:  assembled  with  48  Silicon 
Graphics/Cray  0rigin2000  servers  (each  is 
configured  with  128  SMP  processors)  containing  a 
total  of  6,144  processors,  with  projected  peak 
performance  of  3  Tflops; 

3.  ASCI  Blue  Pacific:  has  1,344  PowerPC  604e 
processors  (332  MHz),  504  Gbytes  of  RAM,  nodes 
connected  through  an  Omega  Network  with  a 
node-to-node  bandwidth  of  150MB/sec  and  offers  a 
peak  performance  of  0.89  Tflops. 

Also  these  platforms,  like  the  most  widespread 
commercial  parallel  systems,  are  based  on  commercial 
general-purpose  computing  devices  which  allow  to 
sustain  very  irregular  programming  models.  If,  on  one 
side,  this  property  makes  these  systems  well  suited  for 
most  of  the  computational  tasks  related  to  complex 
scientific  applications,  on  the  other  side  this  can  also  be 
considered  as  their  main  limitation.  In  fact,  the  need  of 
being  general-purpose  implies  that  these  systems  are 
designed  to  support  multitasking/multiuser  operative 
environments,  so  that  most  of  their  silicon,  instead  of 


4  This  work  has  been  performed  in  the  framework  of  an  industrial  collaboration  between  ENEA  (The  Italian  Agency  for 

New  Technologies,  Energy  and  the  Environment)  and  QSW  (Quadrics  Supercomputing  World  Ltd.,  a  Finmeccanica 
group  company). 
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being  devoted  to  implement  computing  devices,  is  used  to 
build  cache  memory  and  control  HW  to  manage  complex 
memory  hierarchies,  out  of  order  execution  of 
instructions,  processor  scheduling  and  multiprocess 
environment.  This  fact,  while  enhancing  system 
operability,  largely  decreases  the  efficiency  per  silicon 
area  in  floating  point  dominated  applications,  being  a 
large  part  of  the  electronic  devices  not  operative  for  most 
of  the  time  [3]. 

A  completely  opposite  approach  to  high  performance 
scientific  computing  can  be  found  in  the  physicists 
community,  where  small  research  groups  are  used  to 
design  by  themselves  dedicated  machines  which  can 
efficiently  solve  their  computational  problems.  An 
example  of  this  approach  is  given  by  the  GRAPE 
(GRAvitational  PipE)  system  [4]:  GRAPE-4,  the  system 
version  now  available,  is  a  special  purpose  computer  for 
astrophysical  simulations  (N-body  gravitational  problems 
requiring  O(N^)  computations)  with  peak  speed  exceeding 
1  Tflops  [5],  GRAPE  system  is  a  completely  not 
programmable  machine,  allowing  only  to  load/read 
initial/final  data  into/from  the  machine.  Extreme 
specialization  is  the  key  to  achieve  very  high  efficiency  in 
the  use  of  the  silicon  area:  in  such  platforms  only  the 
required  functions  are  implemented,  thus  maximizing  the 
performances  per  unit  of  volume  of  electronics.  The 
GRAPE  project  is  going  to  release  a  200  TFlops  computer 
[6],  yielding  a  computational  speed  from  10  to  100  times 
larger  than  that  achievable,  on  the  same  problem,  in  the 
platforms  developed  in  the  framework  of  ASCI  project. 

A  further  example  of  the  advantages  which  can  be 
achieved  through  HW  specialization  is  given  by  the  APE 
project  [7]  launched  by  Italian  physicists,  aimed  to  build  a 
massively  parallel  system  to  be  used  in  Lattice  Quantum 
Chromo  Dynamics  (LQCD).  These  platforms,  the  APE 
series  (APE100  is  the  old  system  [8],  APEmille  is  the  new 
prototype  which  will  be  soon  launched  [9], [10])  are  SIMD 
programmable  systems  equipped  with  up  to  2048  Very 
Long  Instruction  Word  (VLIW)  custom  processors  and 
offering  peak  performances  of  lOOGflops  (APE  100 
series)  and  1  Tflops  (the  new  APEmille  system).  In  both 
cases,  the  machines  in  the  largest  configurations  are 
easily  contained  in  few  rack-mounted  containers. 

In  scientific  computations,  most  of  the  time  is  usually 
spent  in  the  execution  of  quite  regular  codes  which  iterate 
(e.g.  in  time,  frequency,  space)  several  transformations  on 
large  domains  of  data.  In  such  a  computational  scenario, 
heterogeneous  computing  is  a  very  promising  way  to 
achieve  high  performances:  the  key  idea  is  to  connect  a 
(small)  general-purpose  parallel  machine  to  several,  very 
powerful,  specialized  parallel  systems.  The  less  flexible, 
specialized  machines  are  dedicated  to  provide  most  of  the 
computational  power  required  by  the  numerical  programs, 
while  the  general-purpose  machine  is  used  to  give  the 
necessary  flexibility  to  the  whole  system,  coordinating 


tasks  and  pre/post  processing  data  produced  by  the 
specialized  systems. 

Heterogeneous  computing  has  been  used  to  achieve  the 
very  high  performances  required  when  dealing  with 
challenging  problems:  machine  heterogeneity  is  exploited 
to  match  task  heterogeneity,  using  massively  parallel 
systems  as  dedicated,  high-efficiency  boosters  attached  to 
a  single  user  general-purpose  parallel  machine. 

In"  this  work  we  present  the  outcome  of  a  scientific 
program  aimed  at  developing  a  massively  parallel  hybrid 
machine.  In  the  first  part  of  the  paper  a  theoretical 
framework  to  describe  heterogeneous  tasks  and 
heterogeneous  systems  is  presented.  Task  and  machine 
granularity  are  introduced  and  their  influence  on  the 
efficient  implementation  of  heterogeneous  tasks  onto 
heterogeneous  systems  is  discussed.  Then  we  describe  the 
PQEl  prototype,  the  massively  parallel  hybrid  systern 
which  has  been  developed  in  our  research  center.  Along 
with  the  description  of  the  HW  and  the  SW  of  the  system, 
we  discuss  the  rationale  of  such  architecture  and  we 
sketch  the  results  obtained  in  two  different,  successful 
applications  of  the  PQEl  platform.  Finally  the  next 
version  of  this  hybrid  prototype,  now  in  its  final  design 
phase,  is  presented. 


2.  Hierarchical  Modeling  of  Heterogeneous 
Tasks  and  Systems 


An  algorithm  to  be  implemented  on  a  parallel  system 
can  be  represented  as  a  labeled  Control  Data  Flow  Graph 
(CDFG)  G(N,E,C_N,C_E),  being 

1.  N={nili=l,2,...,Ar] 

the  set  of  functionality  necessary  to  implement  the 


algorithm, 

2.  EcNxN={eij=(ni,nj)l  n,  sends  data  to  nj] 
the  internode  communication  set, 

3.  C_N=  { c_nj  I  c_nj  6  N ,  n*  e  N,  i=  1 ,2,. . .  ,N] 

the  node  labeling  set,  containing  the  integer  value 
which  is  a  measure  of  the  complexity  of  the 
functionality  corresponding  to  ^  (e.g.,  the  number  of 
operations  needed  to  implement  n*)  and 

4 .  C_E=  { c_ey  I  c_e jj  e  X ,  ey  €  E } 

the  channel  labeling  set,  containing  the  integer  value 
which  is  a  measure  of  the  complexity  of  the 
communication  corresponding  to 
Datajni  ...  DataJnNjnj 


ey  (e.g.,  the  number 


(ni,c_nj 

ei '  ‘  \  eikN_outj 

Data_outi  ...  Data_out[sj_jni 


fj(Data _ in  i , . .  ,,Data _ i  n  j>j _ in_i. 

<=*  Data_out|,...,Data_outN  _out_i) 


Figure  1:  graph  node,  with  input  and  output 
edges,  representing  the  computation  nr 
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of  byte  sent  through  ey). 

Each  node  n*  in  the  computation  is  associated  to  a 
functionality  which  transforms  N_inj  input  data  (with 
their  corresponding  associated  data  type)  into  N_outj 
output  data  (with  their  data  type).  N_inj  and  N_outj  are, 
respectively,  the  input  and  the  output  degrees  of  n*.  The 
correspondence  between  node  nt  and  function  f*  is 
depicted  in  figure  1 . 

In  a  completely  similar  way,  a  parallel  system  can  be 
represented  through  a  labeled  graph  PS(R,IN,M_R,B_IN), 
being 

L  R={rjl  i=l,2,...,r} 

the  resource  set  (processing  elements  with  their  local 
memory,  shared  memory  banks,  I/O  devices)  which 
can  be  decomposed  into  basic  sets,  i.e. 

R={^i,.,kPi}u{ui=1,..,mMi  }u{uissl,..,tI/0i  }. 

So  the  parallel  system  resource  set  is  constituted  by 
k  sets  of  processing  elements  pj,  each  set  p*  being 
characterized  by  the  number  and  by  the  type  of 
homogeneous  computing  devices  contained  in  it; 
m  sets  of  memory  banks  Mj,  each  set  of  memory 
banks  being  characterized  by  the  number  of 
memory  banks,  by  their  access  time  and  by  the 
size  of  each  bank  given  (in  byte)  by  sizeof(mj), 
mje 

t  sets  of  I/O  channels  I/O*,  each  set  being 
characterized  by  the  directionality,  the  bandwidth 
and  the  number  of  channels  contained  in  it. 

2.  INqPow(R)xPow(R)=  { Cy=(  { rn , . . .  ,rih } ,  { rjb . . .  ,rjn }  )l 

{rn,...,rih}  is  connected  to  {rjj . .,rjn } } 
the  interconnection  network  set,  where  Pow(R) 
denotes  the  power  set  of  R.  Po>v(R)  is  used  to  model 
shared  interconnections:  a  set  of  homogeneous 
processors  pi={p,p,...,p}  sharing  a  memory  bank 
me  Mi  are  represented  through  the  couple  (pi,m);  a 
shared  bus  connecting  the  processors  of  pi  is 
represented  by  the  couple  (pi,pO;  a  point  to  point 
connection  with  one  dedicated  channel  between  two 
not  homogeneous  processors  is  represented  by  the 
couple  (paePi,  pbepj); 

3.  M_R={m_rjl  m_riE  K  r^R,  i=l, 2,. 

the  resource  labeling  set,  which  associates  to  each 
resource  a  number  measuring  its  performances  (e.g. 
the  number  of  flops  executed  per  clock  cycle  by  a 
general  purpose  processor  pe  pi}  the  number  of  clock 
cycles  necessary  to  compute  functionality  /  in  the 
computing  devices  dedicated  to  its  HW 
implementation,  the  access  time  for  shared  memory 
banks  me  Mi5  the  bandwidth  for  I/O  channels 
c_i/oe  I/Oi); 

4.  B_IN={b_Cjj  I  b_Cjj  e  K ,  Cy  e  IN} 

the  labeling  set  which  associates  the  bandwidth  to 
each  channel  eye  IN. 

It  is  fundamental  to  underline  that,  in  the  cases  of  both 
task  and  parallel  system  graphs,  each  node  can  be 


modeled  through  another  task  or  parallel  system  graph: 
such  a  hierarchical  description  of  a  graph  allows  to  put  in 
evidence  only  the  degree  of  parallelism  (and  of  detail) 
which  the  user  wants  to  consider.  Ail  the  lower  level 
details  are  hidden  at  this  stage  of  abstraction.  For  instance, 
a  complex  program  can  be  represented  through  a  CDFG 
in  which  nodes  are  very  complex  routines; '  after  a 
refinement  step,  each  routine  can  be  detailed  through 
several  simpler  routines  (for  instance,  an  iterative  solver 
can  be  expressed  by  means  of  Basic  Linear  Algebra 
Subroutines  (BLAS));  going  on  with  the  zooming  of 
details,  each  BLAS  routine  can  be  decomposed  into 
(dependent,  i.e.  interconnected)  elementary  operations 
expressed  in  a  standard  imperative  language  (e.g.,  C  or 
Fortran).  As  example  of  hierarchical  representation  of  a 
parallel  system,  we  can  think  to  a  system  graph  whose 
nodes  are  large  systems  (Vector  Computers,  Distributed 
Memory  SIMD  and  MIMD  systems,  Shared  Memory 
Multiprocessors,  DSP  and  specialized  computing  devices) 
connected  through  some  kind  of  (eventually  not 
homogeneous)  IN.  Each  node  can  be  detailed  through 
several  lower  level  nodes  (processors  of  the  system  and 
their  IN)  which  can  still  be  detailed  through  a  lower  level 
representation  (interconnected  functional  units  within  a 
processor).  A  sketch  of  this  hierarchical  description  is 


3.  Task  and  Machine  Granularity:  Formal 
Definition  of  Heterogeneous  Systems  and 
Heterogeneous  Tasks 

Once  introduced  the  formal  hierarchical  definitions  to 
model  computations  and  parallel  systems,  we  try  to  give  a 
(not  exhaustive)  definition  of  task  and  system 
heterogeneity.  We  need  first  to  introduce  the  fundamental 
concepts  of  task  and  machine  granularity. 
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The  granularity  of  a  task  is  usually  referred  to  as  being 
proportional  to  the  ratio  between  the  computation  and  the 
communication  times  involved  in  the  execution  of  the 
task  [23].  This  definition  of  granularity  is,  indeed, 
machine  dependent,  as  both  communication  and 
computation  times  may  vary  when  the  task  is  executed  on 
different  architectures.  Being  interested  to  a 
heterogeneous  environment,  we  prefer  to  introduce  the 
concepts  of  machine  granularity  (gm)  and  task  granularity 
(gt).  gm  is  a  measure  of  the  balance  between  computational 
and  communication  speed  of  a  system  and  is  defined  as 


£m 


node  peak  computation  speed  _  PCS  ^ 


I/O  bandwidth 


BW 


where  the  node  peak  computation  speed  (PCS)  is  the 
maximal  number  of  operations  per  second  executed  by  the 
node  (usually  PCS  are  'expressed  in  terms  of  flops  in  the 
context  of  numerical  computations).  Previous  definition 
can  be  applied  to  nodes  at  different  hierarchical  levels. 
Referring  to  figure  2,  for  instance,  we  can  define  the 
granularity  of  vector  nodes,  SIMD  nodes  and  MIMD 
nodes;  at  this  level  (the  system  level)  nodes  usually  have 
very  high  granularity  gm,  ranging  typical  computation 
speeds  from  few  Gflops  to  several  tens  of  Gflops  and 
typical  I/O  bandwidth  from  tens  of  Mbyte/sec  up  to  few 
Gbyte/sec  (for  massively  system  with  parallel  fast  I/O);  a 
typical  value  for  a  medium-large  system  can  be 

SO  x  1 0^ 

gm  = - =  100 .  When  moving  to  a  lower  level  of 

500xl06 

detail  (the  sub-system  level),  granularity  of  a  node 
diminishes,  as  typical  computation  speeds  of  today’s 
processing  elements  are  in  the  range  of  few  hundreds  of 
Mflops  up  to  1-2  Gflops  and  communication  bandwidths 
range  from  tens  up  to  few  hundreds  of  Mbyte/sec;  typical 
value  for  a  high-end  processing  element  (like  the  Alpha 
EV6.7)  equipped  with  a  64-bit  PCI  connection  is 

1  3x10^ 

gm  = —i - =  6.5.  Moving  into  a  lower  level  of 

200  xlO6 

detail  (the  processor  level,  inside  the  processing  element), 
granularity  assumes  a  smaller  value,  as  communication 
speed  is  always  in  the  range  of  few  hundreds  of  Mflops  up 
to  few  Gflops,  while  communication  bandwidth 
(processor<=>memory)  ranges  from  few  hundreds  of 
Mbyte/sec  up  to  few  Gbyte/sec;  for  instance,  a  processing 
element  with  an  EV6.7  processor  (peak  speed  1 .3  Gflops) 
and  a  fast  chipset  for  the  memory  control  (e.g  the 
Tsunami  chipset,  allowing  an  internal  memory  bandwidth 
of  2.6Gbyte/sec)  is  characterized  by  a  granularity 

1.3X109  AC 

gm= - —  =  0.5  . 

2.6  xlO9 

We  are  now  able  to  give  the  following 
definition  of  heterogeneous  system:  a  parallel  system 
is  heterogeneous  when 


1 .  it  is  composed  by  more  than  one  computing  element 
and 

2.  its  computing  elements  are  based  on  different 
architectural  paradigms  (Vector  systems,  Distributed 
Memory/Shared  Memory  MIMD  systems,  SIMD 
systems,  etc..)  and/or 

3.  it  can  be  described  through  a  hierarchical 
classification  evidencing  different  node  granularities 
throughout  the  hierarchical  levels. 

gm  is  a  measure  of  the  ratio  between  system 
computation  speed  and  system  communication 
bandwidth.  Following  a  similar  reasoning,  task 
granularity  gt  is  defined  as  a  measure  of  the  balancing  of 
computational  and  communication  requirements  of  a  task 
and  is  defined  as 

number  of  computing  op.  n_op 

pr  = - — . . . —  - -  \£) 

number  of  bytes  of  I/O  data  n_I/0_byte 

The  hierarchical  classification  approach,  used  to  model 
heterogeneous  tasks,  can  be  applied  also  in  the  case  of 
CDFGs.  Given  a  CDFG  with  k  different  nodes,  g,(nj)  is 
the  granularity  of  each  node  (i=l,2,..,k;  n^N)  and  the 
granularity  of  the  whole  CDFG  is  the  maximal  value  of 
the  granularity  of  its  composing  tasks,  i.e. 

g,(CDFG )=maxi=i  k(gt(ni))  (3) 

Granularity  of  a  set  of  nodes  is  defined  as  the  largest 
granularity  in  the  set  because  it  seems  to  be  reasonable  to 
represent  computation/communication  demands  of  a 
complex  task  through  its  largest  component;  in  fact,  given 
for  instance  a  CDFG  with  9  different  nodes  with  the  same 
(small)  granularity  1  and  one  node  with  (large)  granularity 
100,  computation/communication  demands  are  well 
represented  by  the  value  100  (worst  case).  If  we  use  an 
average  value  to  represent  the  global  task  granularity,  in 
previous  example  we  would  obtain  g,=  10.9,  which  clearly 
underestimates  the  influence  of  the  ‘large’  task,  probably 
yielding,  as  we  will  discuss  later,  an  inefficient 
implementation  of  the  task  on  the  parallel  system. 

When  a  hierarchical  representation  of  a  CDFG  is  used, 
the  change  from  the  procedural  level  (i.e.  the  level  in 
which  nodes  represent  routines)  to  the  instruction  level  of 
detail  (nodes  represent  elementary  instruction,  e.g.  basic 
C  statements)  determines  a  decrease  in  the  node 
granularity.  In  fact,  if  we  indicate  with  n  the  size  (in  byte) 
of  the  input/output  parameters,  the  number  of  operations 
N_OP(n)  executed  by  the  routine  has,  in  most  cases,  a 
dependency  law  larger  than  O(n),  i.e.  N_OP(n)>0(n). 
N_OP(n)=0(n)  is  a  lower  bound,  being  O(n)  the  number 
of  elementary  operations  necessary  to  read/write  input 
data  (with  the  obvious  exception  of  data  structures  already 
stored  in  memory  and  communicated  through  a  pointer; 
however,  also  in  this  case  the  following  inequality  (4)  is 
satisfied).  As  a  consequence,  the  law  connecting  the 
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granularity  of  a  task  to  the  size  n  of  input/output  data  is 
given  by 


gt(n)  = 


N_OP(n) 

O(n) 


>0(1) 


(4) 


At  the  instruction  level  the  granularity  is  0(1),  i.e.  the 
number  of  bytes  used  to  encode  input  and  output 
parameters  of  one  operation  is  a  constant  number  (with 
very  few,  and  particular,  exceptions  involving  data 
movement),  because  elementary  operations  manipulate 
one  or  two  scalar  values  and  return  another  scalar  value. 
As  a  consequence,  when  moving  from  the  procedural  to 
the  instruction  level,  task  granularity  does  not  increase 
(typically  diminishes). 

Other  parameters  characterizing  nodes  of  CDFG,  at  the 
procedural  level,  are 

the  type  Te  { ‘control-dominated’,  ‘computation- 
dominated’};  a  node  is  control  dominated  when  has 
small  granularity  and  contains  a  number  of  decision 
operations  (i.e.  conditional  jumps)  significantly  larger 
than  the  computing  operations;  on  the  contrary,  a 
node  is  computation  dominated  when  has  large 
granularity, 

computational  paradigm;  each  node  of  the  graph, 
when  expressed  at  a  lower  level  of  detail,  can  be 
represented  by  means  of  the  ‘data-parallel’,  the 
‘pipeline’,  the  ‘farm’,  the  ‘loop’,  the  ‘unrestricted’ 
structuring  constructs;  for  the  description  of  the 
structuring  constructs,  except  the  ‘unrestricted’,  see 
[29];  the  ‘unrestricted’  paradigm  refers  to  a  generic 
computation  represented  by  means  of  an  irregular 
CDFG. 

We  are  now  able  to  give  the  following 
definition  of  heterogeneous  task:  a  CDFG  represents 
an  heterogeneous  task  when 

1.  it  is  composed  by  more  than  one  node  at  the 
‘procedural’  level  and 

2.  its  nodes  are  based  on  different  computational 
paradigms  or  have  different  types  T  and/or 

3.  its  nodes  have  different  granularities 


4.  Matching  Task  and  System  Heterogeneity 
to  Maximize  System  Performances 

Once  fixed  the  meaning  of  heterogeneity  for  tasks  and 
systems,  it  is  important  to  evaluate  their  mutual  relation 
and  to  describe  the  associated  heterogeneity  parameters 
(granularity,  computational/architectural  paradigms,  node 
type  T).  In  this  framework,  it  is  worth  investigating  the 
connections  among  system/task  granularity,  heterogeneity 
and  global  performances. 

The  granularity  G,  in  its  classical  form,  is  defined  as 
the  ratio  between  Run  time  (R)  and  Communication  time 
(C)  of  a  given  task  [28],  i.e. 


~  task  execution  time  R 

Cj  =  — - - : - =  —  (5) 

task  communication  time  C 

In  order  to  avoid  a  too  formal  explanation,  far  beyond 
the  scope  of  this  paper,  we  do  not  go  into  the  details 
necessary  to  define  task  execution  and  communication 
times;  intuitively,  we  consider  as  execution 
(communication)  time  the  summation  of  all  the  time 
intervals  in  which  at  least  one  computational  unit  (I/O 
channel)  is  computing  (communicating). 

Previous  definition  of  granularity  is  machine 
dependent,  being  execution  and  communication  times 
connected  to  processor  speeds  and  I/O  bandwidths.  The 
previously  introduced  definitions  of  gm  and  gt  can  be  used 
to  make  explicit  this  dependence;  in  fact  G  can  be 
expressed  as 


G  =  t (6) 

£  m 


being  ij  = -  an  efficiency  figure  which  takes  into 

^comm 

account  the  partial  utilization  of  the  processor  speed 
(f/proc  )  and  of  the  bandwidth  (fcomm  )•  In  order  t0  verify 

the  validity  of  (6),  it  is  sufficient  to  substitute  in  it  the 
expressions  of  gt  and  gm  and,  with  few  algebraic 
operations,  we  obtain 


^proc&t 

^comm&m 
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Vproc  n  I/0J)yte 


*7comm 


PCS 

BW 


n_op  1  R 

_77proc  PCS  n_I/0_byte  “  C 

Ucomm  — 

The  actual  value  of  r|  depends  on  the  characteristics  of 
tasks  and  on  their  implementation  on  the  physical  system. 
A  reasonable  estimation,  to  be  confirmed  through  some 
experiences  on  a  given  system,  is  T]=0. 1+0.5.  For 
instance,  when  dealing  with  tasks  with  large  I/O  packets 
(small  granularity),  usually  communication  startup  time  is 
negligible  and  r|comm=l;  in  such  a  case  Tl=T|proc  and 
processor  utilization  in  the  range  from  20%  up  to  60%  is  a 
realistic  figure. 

The  expression  >of  G  as  ratio  between  task  and 
machine  granularity  underlines  how  the  relative  values  of 
task  and  machine  granularity  are  relevant  to  achieve  high 
performances  when  implementing  the  task  on  an  actual 
(heterogeneous)  parallel  system.  Efficient  task 
implementation  requires  to  match  two  conflicting 
behaviors:  that  of  a  task  with  maximum  parallelism  (to 
minimize  execution  time)  with  the  constraint  of 
minimizing  communication  costs  (overheads).  The  only 
information  on  the  granularity  G  is  not  sufficient  to 
determine  if  the  implementation  of  the  task  on  a  machine 
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is  efficient:  G  gives  just  a  measure  of  the  relative 
influence  of  communication  overhead  on  system 
performances.  Given  an  implementation  of  a  task  on  a 
parallel  system,  efficiency  reaches  its  maximum  when,  for 
a  fixed  degree  of  parallelism,  communication  overhead  is 
minimum.  In  fact,  indicating  with  R  the  time  spent 
executing  computations  and  with  C  the  time  spent  in 
communications,  and  defining  as  efficiency  the  ratio 


Actual  Computing  Time 

where  the  Actual  Computing  Time  is  the  elapsed  time 
from  the  start  of  the  parallel  program  till  its  end,  the 
following  inequalities  hold: 


negligible  is  the  communication  time  with  respect  to 
computing  time.  Values  of  G<1  originate  I/O  bound 
problems. 

In  order  to  avoid  situations  with  processors  stalling 
due  to  I/O  operations,  with  a  consequent  strong 
decreasing  of  efficiency,  granularity  of  the  task  should  be 
greater  of  a  certain  value  G0  so  that  efficiency  results 
greater  than  the  minimum  acceptable  threshold  EffT.  If  the 
not  overlapping  model  is  assumed,  the  granularity  must 
respect  the  following  inequality  in  order  to  have  Eff>EffT, 


EffT 

l-Effp 


(10) 


and,  evidencing  dependence  of  G  on  gt  and  gm,  we  obtain 


EffMin  ~ 


R 


R  +  C 


-  <  Eff  <  - 


R 


MAX(R,C) 


-  Eff  jyjax  (8) 


gt  >  EffT  1 
gm  1  -  EffT 


which  can  be  rewritten,  introducing  the  granularity  G,  as 

Eff  Min  =  ^  Eff  * - - - j-  =  Eff  Max  (9) 

1  +  —  MAX(1,— ) 

G  G 

It  is  worthwhile  to  note  that  the  fraction  of  unused 
computational  resources  is  given  by  (1-Eff).  The  lowest 
value  for  the  efficiency  (EffMin)  corresponds  to  the 
complete  absence  of  overlapping  between  computation 
and  communications;  the  highest  value  for  the  efficiency 
(EffMax)  corresponds  to  a  complete  overlapping  between 
computations  and  communications.  In  figure  3  the  values 
EffMin  anf  EffMax  are  sketched  as  function  of  the 
granularity  G. 


Figure  3:  Minimum  and  maximum  efficiency  values 
vs  Granularity 

Actual  efficiency  values  lie  within  the  two  plots,  being 
closer  to  the  lower  or  to  the  higher  depending  on  the 
algorithm  structure  and  the  HW  support  for  computation 
and  communication  overlapping  (number  of  DMA 
channels,  routing  processors). 

G=1  indicates  equality  between  computation  and 
communication  times.  The  larger  is  G,  the  more 


EffT  1 

From  previous  inequality,  setting  k  =  - — — - , 

1  —  Etf  j  7] 

we  obtain  a  fundamental  relation  between  task  and 
machine  granularity: 


gt  >k-gm 


(12) 


In  the  case  of  perfect  overlapping  between 
computation  and  communications,  it  is  easy  to  verify  that 
expression  (12),  from  the  position  G>1,  becomes 


1 

gt  >  g m 

V 


(13) 


Previous  expressions  ensure  a  correct  implementation 
of  tasks  on  heterogeneous  systems.  The  value  k  has  to  be 
estimated  on  the  basis  of  ‘reasonable’  assumptions  about 
the  degree  of  overlapping  between  computations  and 
communications  (expression  of  k  has  been  determined 
assuming  the  worst  case,  with  no  overlapping)  and  about 
the  efficiency  T|  which  can  be  achieved  when 
implementing  the  task  on  the  system. 

The  scheme  to  allocate  a  heterogeneous  task  onto  a 
heterogeneous  system  is  the  following: 

Consider  the  highest  levels  of  detail  both  for  the 
system  and  the  task  graphs; 

Ordinate  the  node  tasks  in  descending  order  of 
computational  complexity; 

For  each  node  in  the  task  graph,  chosen  according  to 
previous  decreasing  ordering,  select  the  system  nodes 
which  match,  with  their  architectural  paradigms,  the 
node  computational  paradigm; 

Among  all  the  candidate  system  nodes,  choose  the 
one  which  has  the  highest  computational  power  and 
respects  the  relation  gt>kgm;  as  the  choice  of  the 
system  depends  on  k,  i.e.  on  the  efficiency  of  the 
implementation  of  the  task  node  on  the  system  node, 
the  process  can  be  iterated  at  a  lower  level  of  detail 
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(i.e.  the  node  is  expanded  (if  possible)  into  a  smaller 
granularity  CDFG  and  also  the  system  node  is 
considered  at  a  lower  level  of  detail)  until  a 
reasonable  estimation  for  k  is  achieved. 

Assign  the  task  node  to  the  system  node  found  in 
previous  point  (the  choice  of  the  system  with  the 
highest  computational  power  allows  to  satisfy  the 
tasks  with  highest  computing  requirements). 

As  the  previous  ‘recipe*  does  not  consider  the  load 
balancing,  some  policy  must  be  chosen  to  avoid  the 
overloading  of  the  most  powerful  systems;  a  method 
could  be  based  on  a  cyclic  allocation  policy  or  on  some 
dynamic  updating  of  system  performances  (as  a  system 
node  becomes  more  loaded,  its  computational  speed 
appear  smaller  to  the  other  task  nodes  that  must  still  be 
allocated).  In  order  to  take  into  account  precedence 
relations  among  nodes  in  the  CDFG,  techniques  discussed 
in  [23],  [35]  can  be  used. 

5.  The  Heterogeneous  PQE1  System 

The  previous  discussion  is  aimed  at  stressing  that  a 
heterogeneous  system  is  not  a  mere  collection  of  several 
platforms  used,  sometimes,  as  a  parallel  system,  but  it  is 
an  integrated  system  that  must  be  designed  from  scratch 
to  behave  as  a  heterogeneous  parallel  system.  In  fact, 
heterogeneity  is  a  property  of  the  problems  to  be  solved. 
A  ‘well  balanced’  heterogeneous  system  will  thus  provide 
the  best  way  to  solve  complex  ‘real’  problems. 
Heterogeneity  moreover,  avoids  to  over-dimension  a 
parallel  system,  as  the  computational  power  is  ‘dedicated’ 
(according  to  several  computational  paradigms),  allowing 
very  high  efficiency.  The  idea  is  to  avoid,  as  much  as 
possible,  the  use  of  general  purpose  systems  just  because 
they  perform  ‘quite  well’  in  ail  the  problems  but  not  ‘very 
well’  for  any  problem.  On  the  contrary,  heterogeneous 
systems  could  contain  different  ‘dedicated’  parallel 
systems,  some  of  which  very  well  suited  for  a  certain 
class  of  problems,  others  for  others  different  classes.  In 
this  way,  in  principle,  it  would  be  possible  to  have  a 
system  which  often  behaves  ‘very  well’  on  a  lot  of 
problems,  because  different  parts  of  a  complex 
application  could  be  efficiently  implemented  onto 
architecturally  different  parts  of  the  system.  Furthermore, 
on  the  basis  of  the  previous  analysis,  specialized 
architectures  are,  often,  less  costly  (in  terms  of  silicon 
area,  power  consumption,  volume)  than  general  purpose 
systems. 

5.1  Rationale  for  the  PQE1  prototype 

General-purpose  parallel  machines  support  the  Single 
Program  Multiple  Data  (SPMD)  asynchronous 
programming  paradigm.  Their  HW  structure  is  inherently 
asynchronous  and  some  silicon  area,  other  than  some 


time,  must  be  wasted  to  manage  process  synchronization 
and  asynchronous  communications.  Such  a  wasting  of 
resources  can  be  avoided  by  using  synchronous  machines 
to  which  could  be  efficiently  allotted  computational  tasks 
requiring  synchronous  algorithms. 

On  the  basis  of  the  experience  gained  using  SIMD 
systems  in  several  fields  of  technical-scientific  computing 
(material  science  [14][15],  astrophysics  [40],  atmospheric 
modeling  [16],  image  processing  and  compression 
[17][18],  computational  electromagnetic  [19][20],  linear 
algebra  [21],  neural  networks  [22])  we  are  convinced  that 
SIMD  architectural  paradigm  can  efficiently  express 
programs  solving  problems  related  to  such  fields. 
Moreover  it  is  also  preferable  to  the  MIMD  paradigm 
because  many  algorithms 

1.  are  synchronous; 

2.  often  require  that  all  the  processors  execute  the  same 
instructions  on  different  domains; 

3.  need  interprocessor  communications  executed  in  a 
synchronous  way; 

4.  do  not  need  deep  memory  hierarchies  thanks  to  the 
regular  patterns  of  memory  accesses. 

Point  1  and  3  show  that,  for  such  classes  of  algorithms, 
the  time  spent  in  synchronization  phases,  required  by 
MIMD  systems,  is  a  completely  unnecessary  overhead 
introduced,  by  the  asynchronous  HW  structure  of  the 
machine.  This  overhead  is  not  required  by  SIMD 
machines  with  synchronous  communications.  Point  2 
shows  that  all  the  HW  dedicated  to  manage  different 
program  flows  in  the  processors  is  unnecessary,  being 
sufficient  one  centralized  controller  of  program  flow. 
Point  4  means  that  cache  memory  and  the  related 
management  policies  are  not  needed  in  most  scientific 
applications,  being  the  ‘locality’  of  the  problem  [30] 
easily  controlled  by  the  programmer  through  instructions 
of  vector  movements  between  main  memory  and  an 
internal  register  file.  Although  cache  memory  results  to  be 
particularly  useful  in  multi-programmed  environments, 
where  several  processes  are  running  and  the  fast  memory 
is  not  large  enough  to  keep  the  whole  image  of  all  the 
running  processes,  in  most  cases  of  scientific  computing 
only  one  process  is  running  and  its  locality  is  easily 
captured  by  the  programmer  through  instructions  which 
allow  burst  memory  transfers,  through  DMA  channels, 
between  the  slow  external  RAM  and  a  fast  internal 
register  file  (or  a  multi -port/multi-bank  internal  static 
memory).  A  further  discussion  on  SIMD  vs  MIMD 
architectures,  along  with  a  description  of  SIMD/MIMD 
mixed  mode  systems,  is  reported  in  [27]. 

5.2  HW  description  of  the  PQE1  prototype 

The  PQE1  is  an  ‘hybrid’  MIMD-SIMD  platform  where 
the  flexibility  and  operability  of  a  MIMD  (distributed 
memory)  architecture  (the  eight  node  Meiko/QSW  CS-2) 
are  coupled  to  the  power  and  efficiency  of  SIMD 
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machines  (7  APE  100/Quadrics  systems)  which  enable  to 
efficiently  perform  in  small  granularity  tasks. 

If  we  take  into  account  the  4  points  listed  above  and 
we  assume  that  most  algorithms  arising  in  scientific 
applications  can  be  expressed  through  synchronous 
programs  with  synchronous  communications,  executing 
the  same  instruction  on  a  set  of  different  data  which  can 
be  easily  mapped  onto  a  data  parallel  structure  with 
regular  patterns  of  memory  access,  it  results  very 
reasonable  to  allot  those  parts  of  the  computation  to  the 
SIMD  machine  APE  100/Quadrics,  leaving  the  remaining 
tasks  of  the  computation  to  be  executed  on  the  MIMD 
part. 

We  used  7  APE  100/Quadrics  machines,  built  in  1994: 
two  with  512  processors  arranged  as  an  (8x8x8)  3D  torus 
and  5  with  128  processors  arranged  as  an  (8x4x4)  3D 
torus.  Each  computing  node  is  based  on  a  custom  VLIW 
processor,  has  clock  frequency  fck=25  MHz  and  is  able  to 
terminate  a  ‘normal  operation’  AxB+C  every  clock  cycle, 
so  each  processor  executes  two  floating  point  operations 
in  one  clock  cycle  (when  the  pipeline  is  full)  and  has  a 
peak  speed  of  50  Mflops;  floating  point  are  represented 
according  to  the  IEEE  754  standard  (single  precision). 
Each  node  is  connected  to  a  data  memory  of  4Mbytes  and 
has  an  internal  register  file  (RF)  with  128  registers;  each 
clock  cycle  the  processor  is  able  to  read  two  operands 
from  RF  and  write  one  result  to  RF.  Communications 
with  other  adjacent  nodes,  connected  in  the  north,  south, 
east,  west,  up  and  down  directions  are  synchronous  and 
memory  mapped;  interprocessor  communication 
bandwidth  is  12.5  Mbyte/sec,  so  the  512  (128)  processor 
configuration  has  an  aggregate  bandwidth  of  6.4  (1.6) 
Gbyte/sec  and  a  peak  speed  of  25.6  (6.4)  Gflops. 

The  connection  of  the  APE  100/Quadrics  machines  to  a 
MIMD  system,  the  Meiko/QSW  CS-2  [24],  has  been 
performed  to  give  more  flexibility  to  the  SIMD  machines. 
Each  node  of  the  MIMD  platform  is  based  on  two  Ultra 
Sparc  processors,  connected  in  the  SMP  configuration,  it 
offers  a  peak  speed  of  180  Mflops  and  has  128  Mbytes  of 
RAM.  The  connection  between  CS-2  nodes  and 
APE  100/Quadrics  systems  is  implemented  through  an 
HiPPI  (High  Performance  Parallel  Interface)  channel, 
which  provides  a  bandwidth  of  20  Mbyte/sec.  The 
connection  among  the  nodes  of  the  CS-2  machine  takes 
place  via  the  Meiko/QSW  proprietary  network  based  on 
the  ASIC  circuits  Elan/Elite  and  implementing  a 
multistage  interconnection  network  with  Fat  Tree 
topology  and  point-to-point  bandwidth  of  100  Mbyte/sec. 
The  scheme  of  the  complete  prototype  is  shown  in  Fig.4. 

The  PQE1  hybrid  systems  is  thus  composed  by  7 
SIMD  machines  which  allow  to  obtain  an  aggregate 
computational  speed  of  83.2  Gflops,  20.8  Gbyte/sec  of 
bandwidth  and  6.5  Gbytes  of  RAM.  These  parallel 
systems  communicate  through  7  HiPPI  channels  with  a 
CS-2  machines,  so  the  communication  bandwidth 


between  the  two  systems  is  140  Mbytes/sec.  The  CS-2 
MIMD  parallel  system  has  8  twin  nodes,  offers  a  peak 
speed  of  1  Gflops,  has  1  Gbyte  of  RAM  and  has  an 
aggregate  bandwidth  of  800  Mbyte/sec. 

Looking  at  previous  data,  it  is  clear  that  the  machine  is 
strongly  unbalanced,  having  the  most  of  computational 
and  communication  speed  in  the  SIMD  part.  If  we  analyze 
the  sub-unit  composed  by  a  CS-2  node  and  the  attached 
SIMD  machine,  seen  as  co-processing  system,  the 


resulting  sub-unit  granularity  (at  the  system  level)  is 


gm(^P£100-#512)=  256  ^ =1280  (14) 
20106 


and 


gm(^/>£100-#128)  =  320  (15) 

The  PQE1  system  can  be  considered,  at  a  first  level,  as 
a  parallel  machine  with  small  parallelism  (parallelism 
degree  is  7).  In  order  to  avoid  wasting  of  performances,  at 
this  level  of  parallelization  we  have  to  consider  only  very 
large  granularity  tasks.  If  we  consider,  for  instance,  the 
product  of  two  (n  x  n)  single  precision  matrices,  the  task 


granularity  is  given  by  gt  = 


2n  _  n 

12n2  6 

2 


(2n3  is  the 


number  of  operations  required,  while  12n  is  the  number 
of  byte  to  transfer,  being  necessary  reading  the  two  input 
matrices  and  writing  the  result  matrix).  In  order  to  avoid 
I/O  bound  behavior,  r|gt>gm  must  result;  in  the  case  of  a 
128  processor  machine,  supposing  r|=0.5  (reasonable 
value  for  this  type  of  computations,  using  sequences  of 
not  independent  operations  of  the  type  AxB+C),  this 
corresponds  to  the  condition  n>3840. 
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The  second  level  of  parallelism  can  be  exploited  within 
the  single  task.  The  SIMD  machine  has  granularity  (at  the 
sub-system  level)  is 


_  25.6  _  6.4 
8m~  6.4  "1.6 


4  (16) 


In  this  case  we  have  a  lot  of  parallelism  available 
(parallelism  degree  is  128  or  512)  and  we  can  deal  with 
small  granularity  tasks. 

As  stated  above,  the  rationale  for  such  a  strong 
machine  imbalance  is  that  SIMD  systems  are  very  well 
suited  to  implement  numerical  computations,  allowing  to 
reach  very  high  sustained  performances.  The  MIMD 
nodes  are  not  devoted  to  solve  the  ‘number  crunching’ 
part  of  the  problem,  but  to  perform  data  pre/post 
processing  and  to  allow  communications  among  different 
algorithms  implemented  on  the  SIMD  systems.  We 
underline  that  typical  sustained  performances  obtained  on 
the  APE  100/Quadrics  machines  range  from  30%  to  70% 
of  the  peak  performances,  i.e.  they  vary  from  7.7  to  18 
Gflops  on  the  512  node  machines. 


5.3  SW  description  of  the  PQE1  prototype 

The  basic  modality  to  program  PQE1  system  is  the 
using  of  a  message  passing  paradigm  (the  MPI  library)  to 
manage  the  high  granularity  tasks  allocated  into  the 
MIMD  part.  In  order  to  allow  a  low-level  interaction 
between  CS-2  nodes  and  SIMD  machines,  a 
communication  library  has  been  devised  and 
implemented.  This  library  contains  a  set  of  commands  to 
load/run  programs  into  the  SIMD  machines,  to 
synchronize  the  execution  between  the  program  running 
on  the  MIMD  node  and  the  program  running  on  the 
connected  SIMD  system,  to  communicate  data  to/from  the 
SIMD  system.  Due  to  the  large  granularity  of  the 
programs  running  on  the  SIMD  nodes,  no  particular  effort 
has  been  spent  to  reduce  start-up  times  which,  for  all  the 
operations,  are  in  the  order  of  10  ms. 

As  the  MIMD  system  is  devoted  to  manage  the  whole 
hybrid  system  and  to  increase  the  flexibility  of  the  PQE1 
platform,  a  library  implementing  the  functionality  of  a 
Distributed  Virtual  Shared  Memory  (DVSM)  was 
developed  [25].  This  library  allows  to  declare  physically 
distributed  memory  areas  as  ‘shared’,  thus  allowing  the 
user  to  operate  on  such  areas  with  the  usual  operations  of 
locking/unlocking  and  implementing  atomic  instructions 
to  perform  blocking/non-blocking  read/write  operations 
with  synchronized/unsynchronized  access.  Typical  times 
for  locking  (unlocking)  an  area  are  60  (45)  ps;  the  time 
necessary  to  access  in  writing  (reading)  a  page  is  19  (50) 
ps.  Previous  times  do  not  depend  on  the  size  of  the 
memory  area. 

A  further  tool,  called  SklE-CL  [26],  has  been  devised 
and  implemented  to  improve  the  programmability  of 


PQE1  system,  is  a  skeleton  based  coordination  language 
which  allows  to  express  task/data  parallelism  through 
some  predefined  schemes  (pipeline,  farm,  map,  loop). 
Once  the  program  has  been  written  through  the  available 
parallel  constructs,  SklE-CL  is  able  to  generate  MPI  code 
to  program  the  MIMD  part  of  the  machine,  performing  a 
(near)-optimal  mapping  of  tasks  on  the  MIMD  part  of  the 
system,  by  using  some  analytical  model  of  the  constructs; 
furthermore  SklE-CL  allows  to  control  the  SIMD  systems 
by  means  of  the  communication  library  described  above. 

Previous  tools  (the  DVSM  and  the  SUE-CL)  were 
jointly  developed  by  QSW  and  the  Information  Science 
Department  of  University  of  Pisa. 

Two  interesting  applications  using  PQE1  prototype 
features,  i.e.  overlapping  computations  between  the  SIMS) 
and  the  MIMD  parts  of  the  system  can  be  found  in  [21] 
and  [16].  The  first  refers  to  the  implementation  of  Basic 
Linear  Algebra  Subroutines-3  on  the  SIMD  part  of  the 
system.  The  MIMD  connections  are  used  to  perform  a 
block-based  partitioned  matrix-matrix  product,  being  the 
sub-blocks  products  distributed  among  several  SIMD 
machines.  The  second  work  is  related  to  the 
implementation  of  a  high  resolution  meteorological 
limited  area  model  coupled  with  an  ocean  model  for  the 
prediction  of  the  state  of  the  Mediterranean  Sea  and  of 
high  water  events  in  the  Venice  Lagoon.  The  code  was 
parallelized  by  allotting  the  computation  of  the  most  time 
consuming  models  (the  High  Resolution  and  the  Very 
High  Resolution  Limited  Area  Models)  to  the  SIMD  part 
and  the  resolution  of  the  less  intensive  computing  spectral 
wave  model  (WAM)  to  the  MIMD  nodes.  To  these  nodes 
is  also  demanded  the  computation  of  the  two  dimensional 
model  (POM)  for  the  prevision  of  the  Adriatic  Sea 
circulation  and,  ultimately,  the  finite  elements  shallow 
water  model  of  the  Venice  Lagoon. 

In  the  following  two  paragraphs  we  give  some  details 
on  the  implementations  and  the  results  achieved  when 
using  the  PQE1  system  to  perform  n-body  gravitational 
computations  and  electromagnetic  simulations. 

5.4  n-body  computations 

The  PQE1  architecture  has  been  recently  used  for 
performing  n-body  (0(N2)  calculations  to  study  the 
dynamic  behavior  of  a  galactic  globular  cluster  hosting  a 
massive  object  (black  hole)  in  its  center  [40].  Calculations 
have  been  carried  out  by  exploiting  a  double  level  of 
parallelism  which  can  be  attained  with  the  machine:  the 
first,  related  to  the  SIMD  parallelization  of  the  0(N2) 
loop,  was  obtained  by  partitioning  the  stellar  positions 
among  the  different  nodes  and  by  allotting  the  force 
calculations  on  the  given  partition  to  the  single  SIMD 
node.  The  hypersystolic  loop  ([36], [37)]  has  been 
successfully  used  to  reduce  communication  times  within 
the  force  loop  calculation.  The  second  level  of  parallelism 
has  been  exploited  by  using  the  MIMD  resources  to 
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evaluate  the  black  hole-stars  interactions  (O(N)  loop) 
during  the  time  spent  by  the  SIMD  part  to  evaluate  the 
interstellar  interactions.  The  concurrent  use  of  both  the 
MIMD  and  the  SIMD  parts  allowed  to  perform  the 
integration  of  one  reference  time  (crossing  time)  of  a 
system  of  N=  128000  stars  in  a  CPU  time  of  the  order  of 
t=72500  sec  (with  the  SIMD  part  constituted  by  a 
platform  with  512  nodes). 


5.5  Electromagnetic  simulation 


We  investigated  the  simulation  of  dynamic  evolution 
of  electromagnetic  fields  through  the  integration  of 
Maxwell  equations  by  means  of  the  Finite  Difference  in 
the  Time  Domain  (FDTD)  scheme.  A  domain  with 
(n  x  n  x  n)  cells  was  considered.  Simulating  one  period  of 
the  input  signal  requires  Ns  time  steps.  At  the  end  of  each 
period  of  the  simulation  (i.e.  at  simulation  time  n+Ns, 
n=0, 1 , . . .)  in  each  cell  the  value 

Emax(i.j.k)=  "lax  (e"^) 
t=l,...,Ns 

is  computed.  These  maximal  values  are  then 
sub-sampled  with  step  s  and  communicated  to  the  host  to 
be  post-processed  (for  example  reordered,  normalized  and 
stored).  In  order  to  simulate  an  EM  phenomenon  with 
frequency  f=1.9  GHz  on  a  domain  with  (n  x  n  x  n)  cells, 
we  have  chosen  spatial  discretization  A=1.5  cm  and 
temporal  discretization  At=2.88xl0n  [sec]  to  avoid 
numerical  and  modal  dispersion,  so  Ns=19.  The  number  of 
computations  executed  in  one  period  is 

Nflops  =  Nsx36xn3=684n3  (17) 


Setting  the  sub-sampling  step  s= 5  (i.e.  two  samples  for 
wavelength  are  saved),  at  the  end  of  each  period  the 
number  of  bytes  to  be  sent  is  given  by 

Nbytes  =  4xf— =-^~ n3  (18) 

^  s  j  125 

According  to  (2),  task  granularity  is  given  by 


684n  J 

-^-n3 

125 


=  21675  (19) 


From  (6),  (14)  and  (19),  assuming  an  efficiency  in  the 
implementation  r|=0.2,  we  obtain  the  granularity  value 
for  the  EM  simulation  executed  on  the  512  processor 
APE  100  system 


G(  APE-#5 1 2)  =  H—il 
6m 


0.2x21675 

1280 


and,  from  (6),  (15)  and  (19)  the  granularity  value  for  the 
execution  on  the  128  processor  APE  100  system 


G(APE-#128)  =  Ulli. 

Sm 


0.2x21675 

320 


13.5 


Resulting  G>1  in  both  previous  cases,  the  simulation 
of  one  period  of  the  EM  phenomenon  and  the 
communication  of  sub-sampled  results  does  not  originate 
an  I/O  bound  problem. 

The  second  level  of  parallelism  can  be  exploited  within 
the  single  FDTD  task.  In  this  case  we  have  a  lot  of 
parallelism  available  (parallelism  degree  is  128  or  512) 
and  we  can  deal  with  small  granularity  tasks.  For 
example,  going  inside  the  structure  of  the  parallel  FDTD 
simulation  (described  in  [20]),  36(nc)3  is  the  number  of 
floating  point  operations  executed  in  one  time  step  within 
a  processor  where  (nc  xncx  nc)  cells  have  been  allocated 
and  2x6x4(nc)2  is  the  number  of  bytes  to  communicate  at 
each  time  step  (3  faces  with  two  of  the  Ex,  Ey,  Ez 
components  (depending  on  the  face)  and  3  faces  with  two 
of  the  Hx,  Hy,  Hz,  components  must  be  communicated);  in 
such  a  case  task  granularity  is  given  by 


gt  = 


aefnc)3  3 

48(n  c)2  4" 


(20) 


In  order  to  avoid  an  I/O  bound  problem,  being  in  the 
case  in  which  the  overlapping  between  communications 

and  computations  is  allowed,  gt  >—  gm  must  result 


(eq(13));  from(13),  (16)  and  (20)  we  derive  the  condition 


> - 4- 

0.2 


=  27 


which  gives  the  linear 


dimensions  of  the  sub-domain  in  which  the  global 
simulation  domain  is  partitioned. 

Performances  achieved  in  EM  simulations  were  close 
to  the  value  T|=0.1,  which  corresponds  to  sustained 
performances  of  2.5  Gflops  when  using  the  PQE1  system 
with  one  512  node  SIMD  machine.  This  quite  low  figure 
is  due  to  the  Absorbing  Boundary  Conditions  (ABC),  not 
discussed  above,  which  present  a  very  low  degree  of 
small  granularity  parallelism,  thus  diminishing  the  global 
performances  of  the  system. 


6.  Next  generation  of  the  PQE1  hybrid 
prototype 

The  very  interesting  results  obtained  with  the  hybrid 
PQE1  prototype  confirmed  the  validity  of  the  approach  of 
coupling  specialized  massively  parallel  systems  to  general 
purpose  parallel  machines.  The  PQE1  prototype, 
presented  in  this  work,  is  based  on  HW  of  a  previous 
technological  generation:  we  are  thus  planning  to  design  a 
new  system  with  up-to-date  components.  A  next  system  is 
planned  and  will  be  based  on  several  images  of  the  new 
APEmille  SIMD  parallel  machine.  The  MIMD  part  will 
be  constituted,  according  with  recent  trends  in  parallel 
computing  with  large  granularity  systems,  by  a  Linux 
cluster  connected  through  a  proprietary  fast 


26 


interconnection  network.  Furthermore,  the  new  prototype 
will  allow  the  insertion  of  ad  hoc  designed  specialized 
systems,  based  on  programmable  HW  (e.g.  FPGA). 

One  of  the  main  novelty  of  the  next  generation 
prototype,  along  with  its  technological  improvements 
which  put  it  in  the  very  high-end  section  of  today 
supercomputers,  relies  on  the  possibility  to  apply  and  test 
methodologies  derived  from  the  HW/SW  co-design  field. 
In  fact,  the  capability  to  implement  on  programmable  HW 
some  specific  classes  of  algorithms  will  allow,  at  compile 
time,  on  the  basis  of  some  cost  criteria,  the  choice 
between  SW  or  HW  implementation  of  some  nodes  in  the 
CDFG  specifying  the  application  behavior. 

6.1  The  APEmille  system 

APEmille,  being  the  evolution  of  the 
Quadrics/ APE  1 00  system,  is  a  SIMD  machine.  The  first 
prototypes  have  been  built  in  1999.  Similarly  to  APE100, 
APEmille  has  a  3D  toroidal  topology  and  uses  custom 
VLIW  processors.  Each  processor,  working  at  a  clock 
frequency  of  66  MHz,  at  every  clock  cycle  is  able  to 
terminate  a  ‘normal  operation’  AxB+C  on  complex 
numbers.  As  executing  8  floating  point  operations  per 
cycle,  the  peak  performance  of  an  APEmille  processor  is 
equal  to  528  Mflops. 

Each  node  has  an  internal  register  file  with  512 
locations  at  32  bits  and  is  equipped  with  32  Mbytes  of 
Synchronous  DRAM  which  can  be  accessed  with  a 
bandwidth  of  528  Mbyte/sec,  thus  resulting  in  a  node 
granularity,  at  the  processor  level,  gm=l. 

Each  node  can  access  memory  of  its  neighbors  in  the  3 
spatial  directions  with  a  bandwidth  of  66Mbyte/sec,  so  the 
granularity  of  APEmille  machine  with  p  processors,  at  the 

.  ,  .  ,  .  »-528xl06  „ 

sub-system  level,  is  gm  =  — - =  8  . 

p- 66xl06 

The  I/O  is  based  on  the  use  of  one  PCI  channel  for 
each  cluster  of  32  computing  nodes,  thus  resulting  in  a 
granularity,  at  the  system  level, 

32-528X106  „n  ,  . 

gm - 7 —  =  170,  being  100  MByte/sec  the 

lOOxlO6 

actual  bandwidth  measured  on  the  PCI  channel. 

The  largest  configuration  of  the  APEmille  is 
constituted  by  2048  nodes,  yielding  a  peak  speed 
exceeding  1  Tflops. 

Other  than  the  improvements  in  processor  and  memory 
access  speeds,  APEmille  differs  from  APE  100  because 
double  precision  and  integer  operations  are  provided  in 
the  computing  nodes. 

6.2  The  MIMD  system 

Following  the  evolution  of  high-end  commodity 
processors,  as  computing  core  the  ALPHA  EV6.7  has 


been  chosen  because  of  its  high  computational 
performances  (1.3  Gflops). 

The  MIMD  system  will  be  based  on  16  nodes,  each 
equipped  with  1  Gbyte  of  DRAM.  The  nodes  are 
constituted  by  two  EV6.7  processors  connected  in  SMP 
configuration.  Internal  memory  bandwidth  is  2.6 
Gbyte/sec,  so  the  granularity  at  the  node  level,  is 


gm 


21.3X109 

2.6xl09 


Interconnection  network  uses  the  QsNet,  based  on  the 
Elan  III  network  adapter  and  the  Elite  III  switch.  QsNet 
[31]  has  a  fat-tree  topology,  as  shown  in  figure  5  for  a  128 
node  system,  and  offers  a  remote  access  latency  of  2.5 
(isec  and  a  bandwidth  of  210  Mbyte/sec.  For  a  system 
with  p  nodes,  granularity  at  the  interval  level  is 


gm  — 


p- 2.6X109 
p-llOxlO6 


=  12.4.  Granularity  at  the  system 


level  has  the  same  value,  because  both  interprocessor 
communications  and  I/O  operations  are  limited  by  the  PCI 
speed. 


An  interesting  comparison  enlightening  the  better 
performances  of  QsNet  with  respect  to  the  Gigabit 
Ethernet  and  Myrinet  networks  are  reported  in  [41],  where 
the  MPI  measured  latency  and  bandwidth  are  given.  In 
Table  1  we  summarize  such  values. 


Table  1:  Network  Comi 

oarisons 

Network 

Latency  (ps) 

Bandwidth 

(MB/s) 

Fast  Ethernet 

50 

12.5 

Gigabit  Ethernet 

15 

125  ' 

Myrinet 

20 

62  1 

QsNet 

5 

200  1 

6.3  Specialized  system  design 

In  order  to  design  specialized  HW  systems,  we  have 
developed  a  High  Level  Synthesis  (HLS)  methodology 
which,  starting  from  a  high  level  description  of  an  affine 
iterative  algorithm,  allows  its  automatic  hardware 
synthesis;  theoretical  basis  of  this  approach  can  be  found 
in  [38], [39].  The  HLS  methodology  is  based  on  ^  sequence 
of  steps  which  transform  the  high  level  description  into 
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several  lower  level  representations,  until  reaching  the 
hardware  implementation  (described  through  a  Hardware 
Description  Language).  Each  transformation  step  is 
correct-by-construction,  i.e.  it  preserve  application 
semantics  allowing  the  automatic  implementation  of  the 
HLS  methodology.  In  order  to  ensure  the  generation  of 
correct-by-construction  transformation  steps,  the 
algorithm  high  level  description  is  given  through  a 
mathematical  model  of  computation.  In  such  a  way  each 
transformation  step  is  mathematically  proved  to  be 
correct. 

The  chosen  model  of  computation  is  the  System  of 
Affine  Recurrence  Equations  (SARE)  ([32]  [33]  [34]) 
which  is  one  of  the  most  promising  model  of  computation 
in  such  fields  arising  in  signal  and  image  processing, 
linear  algebra,  scientific  computing.  SARE  computational 
model  allows  the  specification  of  an  algorithm  by  means 
of  recurrence  equations. 

7.  Conclusions 

In  this  work  a  brief  review  of  the  supercomputer 
scenario  has  been  presented,  discussing  advantages  of 
custom  vs  commodity  system  implementation. 

Some  theoretical  aspects  involved  in  heterogeneous 
system  design  and  management  have  been  introduced. 
Particular  emphasis  has  been  devoted  to  definition  and 
discussions  of  task  and  system  granularity.  After 
underlying  impact  of  a  correct  matching  between 
task/system  granularity,  they  were  presented  some  results 
obtained  in  a  scientific  project  aimed  to  exploit  the 
advantages  connected  both  to  heterogeneity  and  to  the  use 
of  custom  parallel  architectures.  The  outcome  of  this 
project  was  the  PQE1  hybrid  parallel  system.  After  a  brief 
description  of  its  HW  and  SW  environment,  some 
examples  of  its  use  in  several  application  domains  have 
been  reported  (simulation  of  the  sea  level  in  the  Venice 
lagoon,  of  the  dynamic  of  galactic  globular  cluster,  of 
electromagnetic  field  evolution). 

Finally,  on  the  basis  of  the  experience  gained  while 
developing  this  project,  the  HW/SW  architecture  of  a  next 
hybrid  parallel  prototype  has  been  shortly  presented. 
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Abstract 

In  this  paper,  we  present  the  results  of  the  NRW- 
Metacomputing  taskforce,  which  has  been  working  on  the 
development  of  a  country-wide  metacomputer  since  1996. 
The  resulting  installation  is  among  the  very  few  that  are  al¬ 
ready  operational,  have  full  support  for  heterogeneous  re¬ 
sources,  contain  a  decent  security  model,  and  feature  an 
advanced  scheduling  sub-system  for  the  metacomputing  en¬ 
vironment.  The  NRW-Metacomputer  has  been  implemented 
using  a  modular  software  architecture.  Hence,  concepts 
and  components  of  it  can  be  re-used  by  others  without  the 
need  of  having  to  obtain  the  metacomputing-software  as  a 
whole.  Furthermore,  the  NRW-Metacomputer  already  pro¬ 
vides  well  defined  interfaces  for  linking  the  system  with 
other  metacomputing  environments  to  form  a  truly  global 
computational  grid.  Distinctive  features  of  this  system  are 
its  highly  scalable  and  fault  tolerant  software  architecture, 
its  advanced  resource  planning  mechanisms  as  well  as  an 
integration  into  a  DCE/DFS  environment. 


1  Metacomputing  and  the  Computational 
Power  Grid 

The  term  metacomputer  was  initially  coined  by  Larry 
Smarr  around  1987  [13].  According  to  his  definition,  a 
metacomputer  is  a  network  of  globally  distributed  machines 
that  are  linked  together  by  a  complex  software  system 
which  enables  them  to  act  like  a  single  very  large  supercom¬ 
puter.  The  advantages  of  the  concepts  are  obvious.  First  of 
all,  a  metacomputer  can  theoretically  provide  more  comput- 
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ing  power  than  any  existing  single  machine.  Furthermore, 
it  offers  its  users  free  choice,  which  machine(s)  to  use  for  a 
specific  job  and  it  helps  the  participating  computing  centers 
to  distribute  the  load  more  evenly. 

Building  a  metacomputer  however  is  a  complex  and 
difficult  task.  Since  the  concept  was  invented,  many  re¬ 
searchers  have  been  working  on  the  subject  and  nowadays 
there  are  several  software  systems,  which  all  cover  different 
aspects  of  Smarr’s  and  Catlett’s  vision.  Among  them  are 
advanced  cluster  management  systems  like  CODINE  [8], 
LSF  [15],  or  CONDOR  [5],  which  focus  mainly  on  con¬ 
necting  Unix  workstations.  Furthermore,  there  are  a  cou¬ 
ple  of  projects  which  take  a  more  general  approach  and 
work  on  the  integration  of  fully  heterogeneous  resources 
like  e.g.  supercomputers  or  remote  data  sources  (e.g.  satel¬ 
lites,  weather  radar,  unique  scientific  instruments).  The 
NRW-Metacomputer  initiative,  which  is  presented  in  this 
paper,  is  one  of  these  projects.  Other  are  e.g.  GLOBUS 
[6],  LEGION  [10],  orMSHN  [11]. 

Over  the  time  it  became  clear  that  a  global  metacom¬ 
puter  which  meets  the  definition  of  Smarr  and  Catlett  and 
is  as  easy  to  use  as  a  single  workstation ,  is  unlikely  to  be 
established  by  any  single  research  group.  The  large  vari¬ 
ety  of  problems  that  have  to  be  solved  (resource  manage¬ 
ment,  administration,  accounting,  security,  scheduling,  etc.) 
requires  close  cooperation  between  researchers  from  many 
different  fields.  This  is  why  in  the  recent  past  a  couple  of 
open  forums  have  been  established,  which  focus  on  the  in¬ 
stallation  of  large  computational  power  grids1  by  putting  to¬ 
gether  the  pieces  that  have  been  invented  during  seven  years 
of  metacomputing  research  all  over  the  world  [1,2]. 

In  the  following,  we  present  the  outcomes  of  the  NRW- 

1  a  synonym  for  metacomputer  which  emphasizes  the  analogy  to  a  na¬ 
tionwide  power  grid 
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Metacomputing  task  force  [3],  which  has  been  working 
on  the  development  of  a  country-wide  metacomputer  since 
1996.  The  resulting  installation  is  among  the  very  few  that 
are  already  operational,  have  full  support  for  heterogeneous 
resources,  contain  a  decent  security  model,  and  features 
an  advanced  scheduling  sub-system  for  the  metacomput¬ 
ing  environment.  The  NRW-Metacomputer  has  been  im¬ 
plemented  using  a  modular  software  architecture.  Hence, 
both  concepts  or  components  of  it  can  be  re-used  by  others 
without  the  need  of  having  to  obtain  the  metacomputing- 
software  as  a  whole.  Furthermore,  the  NRW-Metacomputer 
already  provides  well  defined  interfaces  for  linking  the  sys¬ 
tem  with  other  metacomputing  environments  to  form  a  truly 
global  computational  grid. 

2  Architecture  of  the  NRW-Metacomputer 

The  backbone  of  the  NRW-Metacomputer  was  built  dur¬ 
ing  the  HPCM  (high  performance  computer  management) 
project,  which  provides  basic  infrastructure  for  metacom¬ 
puting  management.  It  features  a  multi-tier  architecture 
in  which  a  server  layer  receives  user  requests  from  access 
modules  (see  Sec.  5)  and  forwards  them  to  the  attached  re¬ 
sources  like  for  example  supercomputers  (Fig.  1).  These  re¬ 
sources  are  encapsulated  by  so  called  coupling  modules  that 
implement  an  abstract  service  layer  on  top  of  the  heteroge¬ 
neous  hardware  pool.  Besides  abstracting  the  available  in¬ 
formation  and  access  methods  from  the  management,  the 
coupling  modules  interact  with  the  locally  installed  man¬ 
agement  system  of  the  HPC  component.  Although  the  cou¬ 
pling  modules  have  to  be  adapted  to  every  new  kind  of  hard¬ 
ware,  each  implementation  is  based  on  a  generic  frame  that 
already  covers  a  significant  amount  of  the  required  func¬ 
tionality.  So  far,  there  exist  implementations  of  coupling 
modules  for  UNIX,  CODINE  [8],  NQE  [4],  CCS  [12]  and 
LoadLeveler.  Additional  modules,  as  e.g.  for  LSF,  can  eas¬ 
ily  be  derived  from  the  existing  implementations. 

One  of  the  major  goals  of  the  initiative  was  to  develop 
a  metacomputer  that  maintains  maximum  autonomy  for  the 
participating  service  centers.  Hence,  each  institute  is  free  to 
tailor  the  behavior  of  its  coupling  modules  to  comply  with 
the  local  policies.  For  example,  if  a  certain  service  shall 
be  made  available  to  the  metacomputer  only  when  the  local 
machines  are  lightly  loaded  or  the  remote  user  is  willing  to 
pay  an  extra  fee,  the  behavior  of  the  coupling  module  can 
be  adjusted  accordingly. 

It  is  furthermore  possible  to  attach  completely  different 
types  of  services  to  the  metacomputer  by  using  the  same 
abstract  service  interface  that  defines  the  behavior  of  the 
coupling  modules.  For  example,  the  scheduling  services 
described  in  Sec.  3  integrate  itself  into  the  metacomputer 
through  this  interface.  A  similar  approach  could  be  used  to 
establish  a  link  between  the  NRW-Metacomputer  and  e.g. 


a  metacomputing  system  in  another  country  or  even  on  a 
different  continent. 

Communication  links  between  the  different  components 
of  the  metacomputer  are  established  by  the  so  called  com¬ 
munication  layer .  This  is  a  separate  module  that  provides 
secured  communication  services  for  both  Java-  and  C++/C- 
based  components.  Currently  the  communication  layer  uses 
TCP  sockets  for  message  passing  and  the  GSS  API  [9]  for 
security.  However,  these  can  be  easily  exchanged  or  even 
mixed  with  other  paradigms,  if  necessary. 

Another  important  issue  for  any  metacomputer  installa¬ 
tion  that  needs  to  be  brought  to  practical  use  is  administra¬ 
tion  and  authentication  of  its  users.  Typically,  service  cen¬ 
ters  already  have  set  high  standards  for  the  management  of 
their  local  users  and  are  not  willing  to  compromise  this  by 
installing  the  metacomputer  software.  Hence,  we  decided 
to  rely  on  the  services  provided  by  the  Distributed  Comput¬ 
ing  Environment  (DCE)  [7],  since  this  is  a  vendor  supported 
product  and  already  accepted  and  used  by  many  computing 
centers  (see  Sec.  4). 

Much  effort  has  been  spent  on  designing  the  NRW- 
Metacomputer  as  a  highly  reliable  and  fault  tolerant  system. 
Its  architecture  does  not  contain  any  single  point  of  failure. 
This  could  be  achieved  by  having  the  distributed  environ¬ 
ment  being  managed  by  the  coordinated  effort  of  the  man¬ 
agement  daemons.  These  daemons  are  all  alike  and  none  of 
them  performs  any  specific  tasks  that  are  not  directly  related 
to  the  corresponding  computing  center.  Hence,  a  failure  at 
one  node  can  at  most  render  that  center  unavailable  where 
the  problem  occurred.  As  long  as  there  is  one  management 
daemon  alive,  the  NRW-Metacomputer  remains  available. 

It  should  be  noted  that  all  management  daemons  actively 
contribute  to  the  operation  of  the  whole  system.  There 
are  no  shadow  daemons  that  monitor  the  system  operations 
silently  until  a  component  fails  and  take  the  place  of  that 
component.  As  a  consequence,  the  principle  of  coopera¬ 
tive  management  not  only  increases  fault  tolerance  but  also 
helps  improving  the  overall  system  performance.  Clearly, 
this  system  architecture  required  some  extra  effort  in  speci¬ 
fying  transaction  protocols  and  during  the  implementation. 
The  key  concept  here  was  to  employ  the  technique  of  virtual 
shared  memory  for  the  metacomputer  (Fig.  2). 

All  information  that  refers  to  the  global  state  of  the  meta¬ 
computer  (such  as  the  lists  of  active  jobs  or  available  re¬ 
sources)  are  stored  in  so  called  shared  objects.  These  are 
object  oriented  encapsulations  of  virtual  shared  memory 
segments.  Whenever  the  content  of  a  shared  object  changes, 
these  modifications  are  transparently  propagated  to  all  other 
remote  instances  of  the  same  object.  Internal  methods  of 
the  shared  object  class  ensure  that  the  data  is  kept  in  a  con¬ 
sistent  state,  even  if  parts  of  the  metacomputer  should  fail 
while  remote  instances  of  shared  objects  were  updated. 

The  use  of  shared  objects  enables  each  management 
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Figure  1.  Layout  of  the  management  infrastructure 


daemon  to  accept  incoming  requests  and  co-ordinate  their 
fulfillment  by  underlying  metacomputing  services.  If  one 
management  daemon  should  fail,  the  remaining  ones  take 
over  its  tasks.  It  only  becomes  visible  to  the  users  that  the 
state  of  their  jobs  is  reverted  to  the  last  completed  transac¬ 
tion.  In  most  cases,  this  equals  the  current  state  of  a  job. 

3  Integrated  Metacomputer-Scheduling 

Job  scheduling  and  resource  allocation  are  one  of  the 
core  problems  in  the  metacomputing  architecture.  The  own¬ 
ers  of  HPC  installations  are  only  willing  to  include  their 
resources  into  a  metacomputer  if  the  performance  of  their 
components  will  not  degrade.  Similarly,  users  expect  a  bet¬ 
ter  performance  for  their  jobs.  Note  that  the  expression  per¬ 
formance  has  not  been  defined  as  different  people  may  at¬ 
tach  a  different  meaning  to  it.  Therefore,  the  scheduler  must 
provide  a  high  efficiency  for  the  metacomputer  while  also 
taking  additional  requirements  into  account. 

3.1  Scheduling  Considerations 

The  paradigms  for  scheduling  on  a  metacomputer  dif¬ 
fers  significantly  from  job  scheduling  on  a  parallel  com¬ 
puter.  Therefore,  we  give  in  the  following  some  properties 


of  meta-scheduling  that  must  be  considered  for  building  a 
computational  power  grid: 

Variable  Scheduling  Objectives:  In  common  job 
scheduling  there  usually  exists  a  single  scheduling  objective 
or  performance  metric  that  is  fixed  for  a  parallel  computer 
and  all  its  jobs.  For  example,  this  can  be  the  minimization 
of  the  average  response  or  turnaround  time  [?].  The  objec¬ 
tive  is  typically  determined  by  the  local  management  system 
or  by  the  administrator.  In  metacomputing  this  objective  is 
variable.  As  we  assume  a  distributed  system  that  is  not  con¬ 
trolled  by  a  single  instance,  the  objective  should  further  be 
adaptable  for  each  resource  in  the  metasystem.  While  the 
schedule  target  for  some  machines  may  for  instance  be  the 
maximization  of  the  throughput,  others  have  the  objective 
to  minimize  the  response  time.  Besides  the  objectives  for 
the  resources,  we  must  also  take  the  needs  of  the  user  into 
account.  Some  user  may  favor  the  availability  of  specific 
resource  properties  like  the  size  of  the  main  memory  while 
other  may  have  additional  constraints  about  the  execution  of 
a  job.  A  typical  example  would  be  a  deadline  for  a  job  that 
must  be  met  while  it  is  of  no  particular  interest  if  the  job  is 
completed  as  fast  as  possible.  For  this  user  the  minimiza¬ 
tion  of  the  response  time  would  not  reflect  her  demands.  In 
metacomputing  it  is  necessary  that  user  objectives  are  con¬ 
sidered.  For  instance,  the  user  may  only  be  interested  in 
resources  that  fit  her  needs  better  than  any  local  resources. 


Figure  2.  The  NRW-Metacomputer  uses  virtual  shared  memory  for  its  distributed  management 


Therefore,  the  scheduling  must  be  adaptable  to  generate  the 
most  appropriate  result. 

Independent  Schedulers:  Usually  a  scheduler  in  meta¬ 
computing  cannot  demand  exclusive  control  over  all  re¬ 
sources.  For  scheduling  in  metasystems,  we  have  to  cope 
with  the  situation  that  jobs  may  not  only  be  submitted  via 
the  metacomputing  interface.  Hence,  any  limitations  of  the 
local  management  must  also  be  considered.  For  instance, 
one  problem  is  the  list  scheduling  of  most  management  sys¬ 
tems  which  do  not  provide  any  information  about  the  ex¬ 
pected  completion  time  of  a  job.  Unfortunately,  availability 
of  this  kind  of  information  is  important  in  distributed  meta¬ 
systems  to  allow  future  allocation  planning.  Of  course,  if 
the  local  management  provides  additional  information,  the 
metacomputing  scheduler  should  be  able  to  utilize  it.  In  this 
case  the  metacomputing  management  does  not  perform  any 
local  scheduling  but  relies  on  the  existing  system  scheduler. 
The  resulting  schedule  efficiency  highly  depends  on  the  fea¬ 
tures  of  the  lower-level  scheduler.  If  a  resource  does  not 
provide  the  requested  features  like  a  guaranteed  completion 
time,  it  cannot  be  considered  suitable  for  some  job  requests. 
This  limits  the  usability  of  this  resource  for  the  metasys¬ 
tem.  Nevertheless,  the  metacomputing  scheduling  should 
support  all  kind  of  local  management  systems. 

Arbitrary  Resource  Requests:  As  job  requirements 
and  resources  in  a  metasystem  may  vary  according  to  type 
and  application,  there  is  a  need  for  the  description  of  com¬ 
plex  requests.  For  instance,  assume  two  different  users: 
The  first  user  does  not  provide  a  very  detailed  request  as 
she  wants  to  get  as  many  computing  resources  as  possible. 
More  restrictive  requirements  would  only  reduce  the  possi¬ 
ble  resource  set  for  her  job.  The  other  user  is  looking  for 
very  specific  resources.  He  may  have  access  to  an  alterna¬ 


tive  set  of  local  resources  for  the  execution  of  his  job  and 
is  therefore  only  looking  for  a  better  resource  allocation. 
Consequently,  he  formulates  special  requirements  and  pref¬ 
erences.  The  meta-scheduler  must  support  both  approaches. 
The  individual  user  should  be  able  to  influence  the  resource 
selection  and  the  scheduling  so  that  she  gets  the  best  suited 
set  of  resources.  The  attributes  of  a  resource  and  therefore 
the  available  fields  in  a  request  should  not  be  considered 
invariant.  Different  resources  may  have  different  attributes 
and  features  that  may  not  be  known  to  the  scheduler  at  the 
time  of  implementation.  But  the  system  should  still  be  able 
to  handle  them. 

Resource  Reservation:  This  feature  is  necessary  for 
some  applications  as  well  as  for  the  consideration  of  re¬ 
source  maintenance.  For  instance,  demonstrations  may  re¬ 
quire  the  reservation  of  a  resource  allocation  for  a  dedicated 
time  span.  It  is  also  advantageous  for  the  schedule  to  con¬ 
sider  system  downtime  or  restricted  usage  that  is  known  in 
advance.  Reservations  are  further  needed  for  multi-site  ap¬ 
plications.  As  there  is  no  global  scheduler  instance,  it  must 
be  possible  for  the  local  scheduler  to  reserve  resources  for 
a  specific  time  span  in  order  to  guarantee  the  concurrent 
availability  of  resources  at  different  locations. 

Job  Execution  Guarantees:  In  metacomputing  it  would 
be  inefficient  to  schedule  jobs  on  an  ad-hoc  strategy  as  it 
is  difficult  to  respect  several  objectives  by  not  assuming  a 
central  scheduler.  For  example,  if  a  job  does  not  need  to  be 
executed  as  soon  as  possible,  this  flexibility  can  be  used  to 
improve  the  schedule.  Assume  again  the  mentioned  case  of 
a  job  with  an  execution  deadline.  Typically,  the  user  needs 
immediate  feedback  whether  his  requirements  can  be  met. 
It  is  therefore  necessary  for  the  user  to  receive  in  advance 
guarantees  about  the  schedule  of  his  job  so  that  he  can  react 
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Figure  3.  Logical  Scheduling  Infrastructure 


accordingly.  The  scheduler  need  not  always  provide  such 
guarantees,  but  it  should  be  capable  of  giving  them  if  they 
are  required.  Those  guarantees  are  additional  constraints  in 
the  scheduling  of  a  job. 

3.2  Scheduling  Architecture 

To  avoid  the  bottleneck  of  a  central  scheduler  and  to  in¬ 
crease  flexibility  a  distributed  approach  is  employed.  To  im¬ 
plement  a  distributed  metacomputing  scheduler  we  use  an 
architecture  which  is  based  upon  so  called  MetaDomains. 
All  MetaDomains  of  a  metacomputer  form  a  redundant  net¬ 
work.  Typically,  a  MetaDomain  is  associated  with  local 
HPC  resources  and  is  controlled  by  a  MetaManager ,  that  is 
all  HPC  resources  at  one  site  are  connected  to  a  single  Meta¬ 
Domain.  We  use  these  autonomous  scheduling  domains  to 
manage  the  local  resources  and  offer  them  to  other  domains. 
This  concept  supports  the  idea  of  a  computational  power 
grid ,  where  several  independent  sites  can  join  a  larger  net¬ 
work  and  share  jobs  and  computing  power  while  not  loos¬ 
ing  control  over  the  local  resources.  The  logical  structure 
of  such  a  scheduler  is  described  in  Fig.  3.  This  network  can 
be  dynamically  extended  or  altered.  The  presented  archi¬ 
tecture  guarantees  a  high  degree  of  flexibility  by  allowing 
different  implementations. 

The  MetaDomains  communicate  among  each  other  by 
transmitting  or  requesting  information  about  resources  and 
jobs.  To  this  end  a  MetaManager  inquires  local  schedulers 
about  system  load  and  job  status.  A  MetaManager  can  also 


allocate  local  HPC  resources  to  requests.  The  distributed 
scheduling  itself  is  based  upon  a  brokerage  and  trading  con¬ 
cept  which  is  executed  between  the  MetaManagers. 

In  detail,  a  MetaDomain  tries  to 

•  satisfy  local  demand  if  possible, 

•  ask  other  MetaDomains  for  resources,  if  the  local  de¬ 
mand  cannot  be  satisfied, 

•  offer  local  HPC  resources  to  other  MetaDomains  for 
suitable  remote  jobs,  and 

•  act  as  an  intermediary  for  remote  requests. 

Once  a  suitable  allocation  of  HPC  resources  (including 
network  resources)  to  a  job  has  been  found,  the  actual  sub¬ 
mission  is  independent  of  the  scheduler.  In  our  architecture 
the  scheduling  objectives  are  not  specified.  As  already  men¬ 
tioned  there  may  not  only  be  a  single  scheduling  objective  in 
a  metacomputer.  Each  HPC  component  can  define  its  spe¬ 
cific  objectives.  Simifarly,  each  user  may  associate  specific 
constraints  with  his  job  like  a  deadline  or  a  cost  limit.  It  is 
the  task  of  the  trading  system  to  find  matches  between  re¬ 
quests  and  offers.  This  way  not  all  users  and  all  components 
are  forced  to  fit  into  a  single  framework  as  is  usually  done 
in  conventional  scheduling.  Now,  it  is  their  responsibility 
to  define  their  own  objectives.  The  implementation  of  the 
metacomputing  scheduler  only  provides  the  framework  for 
such  a  definition  and  it  must  be  able  to  compare  any  request 
with  any  offer  to  find  a  match. 


The  selection  of  the  best  suited  allocation  is  based  on  a 
comparison  of  the  provided  objective  functions.  The  objec¬ 
tive  function  of  a  request  is  applied  to  an  allocation  in  order 
to  generate  a  value  for  the  utility  from  the  user’s  point  of 
view.  Similarly,  the  offers  also  provide  an  objective  func¬ 
tion  or  a  value  for  its  utility  to  represent  the  resource’s  point 
of  view.  Now,  the  responsible  MetaManager  combines  the 
objectives  and  determines  an  allocation  that  maximizes  the 
overall  objective  with  respect  to  its  full  schedule.  It  is  also 
possible  to  provide  the  user  with  a  front-end  that  allows  in¬ 
teractive  selection  of  allocations.  Such  a  front-end  can  also 
be  used  to  obtain  status  information  about  the  metasystem 
with  the  help  of  the  request  mechanism.  This  information 
may  help  a  user  to  generate  a  request  which  results  in  the 
best  suited  set  of  resources  depending  on  the  current  condi¬ 
tion  in  the  metasystem. 

Note  that  our  method  is  not  an  auction  system  as  we 
do  not  provide  a  market  where  several  jobs  compete  for  a 
resource.  Instead  our  schedulers  select  allocations  that  fit 
a  request  best  at  a  particular  time  instance.  However,  the 
selected  allocation  need  not  necessarily  be  executed.  The 
MetaManager  maintains  a  schedule  with  all  current  alloca¬ 
tions  in  its  domain.  Its  scheduler  is  free  to  modify  the  cur¬ 
rent  schedule  at  any  time.  However,  changes  in  the  current 
scheduling  are  only  allowed  as  long  as  they  do  not  violate 
any  guarantees  that  have  been  given  for  a  job.  Requested 
guarantees  are  additional  constraints  that  limit  future  re¬ 
quests  for  rescheduling.  This  procedure  is  used  to  improve 
the  current  schedule  and  to  cope  with  resource  failures  or 
cancellations  of  jobs.  The  rescheduling  requires  new  re¬ 
quests  for  offers  if  other  allocations  are  not  active  anymore. 
Note  that  a  valid  schedule  exists  at  every  moment.  Also, 
there  is  a  tentative  schedule  for  a  job  after  each  request  and 
a  following  allocation. 

Any  improvement  of  the  schedule  is  measured  by  com¬ 
bining  all  objective  values.  To  this  end,  the  scheduler  at¬ 
tempts  to  maximize  the  overall  objective  value  of  the  sched¬ 
ule.  As  an  objective  function  is  received  for  every  request 
and  for  every  offer,  there  is  a  combined  objective  function 
or  value  for  each  allocation.  The  objective  functions  of  all 
allocations  together  define  the  optimization  problem.  An 
improvement  can  be  achieved  for  instance  by  moving  exist¬ 
ing  allocations  while  all  constraints  are  observed.  Alterna¬ 
tively,  the  scheduler  can  look  for  new  allocations.  The  meta¬ 
scheduling  concept  further  supports  multi-site  scheduling 
and  co-allocation.  However,  this  requires  the  inclusion 
of  network  management  as  just  another  high  performance 
computing  resource  to  provide  guaranteed  communication 
bandwidth  between  participating  resources.  In  addition  the 
local  resource  managers  must  provide  offers  with  sched¬ 
ule  guarantees  which  must  be  exactly  met.  Note  that  this 
scheduling  strategy  does  not  guarantee  an  optimal  schedule 
in  general,  but  it  meets  all  requirements  of  Sec.  3. 1  as  sep¬ 


arate  objectives  are  allowed  for  each  resource  in  the  meta¬ 
system. 

In  our  metacomputer  scheduling  concept  only  the  local 
HPC  scheduler  is  responsible  for  the  load  distribution  on 
the  corresponding  HPC  resource.  Therefore,  it  can  also  ac¬ 
cept  jobs  from  sources  other  than  the  metacomputer.  The 
metacomputer  scheduler  only  addresses  the  load  imbalance 
between  different  HPC  resources.  However,  to  execute 
multi  site  applications,  the  concurrent  availability  of  dif¬ 
ferent  HPC  resources  and  sufficient  network  bandwidth  be¬ 
tween  them  becomes  necessary.  For  reasons  of  efficiency 
this  requires  resource  reservation  for  future  time  frames  and 
the  concept  of  guaranteed  availability.  Although  most  HPC 
schedulers  do  not  presently  support  such  an  approach  it  can 
be  implemented  by  using  preemption  (a  checkpoint-restart 
facility)  while  still  maintaining  a  high  system  load. 

In  the  project  SCHEDULE  [14]  of  the  initiative  a  meta¬ 
computer  scheduler  was  designed  using  CORBA  to  allow 
transparent  and  language  independent  access  to  distributed 
management  instances.  For  the  evaluation  of  different 
scheduling  methods  a  simulation  framework  has  further 
been  implemented.  It  is  used  to  compare  different  schedul¬ 
ing  algorithms  regarding  their  applicability  for  a  mctacom- 
puting  network.  The  benefit  of  possible  technology  en¬ 
hancements,  like  for  example  preemption,  to  the  quality  of 
the  schedule  is  also  determined  with  the  help  of  the  simu¬ 
lator.  As  already  mentioned,  communication  between  re¬ 
sources  during  a  multi  site  job  execution  must  be  taken  into 
account  as  well.  To  this  end  the  available  network  must 
be  considered  as  a  limited  resource  that  is  managed  by  the 
schedulers  in  the  MetaDomains.  The  inclusion  of  this  ob¬ 
jective  into  the  scheduler  is  part  of  the  future  work. 

4  Data  Distribution,  Security  and  Adminis¬ 
tration 

Like  every  metacomputing  system  that  is  brought  to 
practical  use,  the  NRW-Metacomputer  has  to  cope  with  a 
variety  of  problems.  Though  it  runs  on  a  large  number  of 
different  hardware  architectures,  it  must  still  provide  a  stan¬ 
dard  API  for  the  HPC  components  and  other  related  soft¬ 
ware.  Furthermore,  it  needs  a  secure  mean  of  communica¬ 
tion  across  the  public  and  inherently  insecure  Internet.  It 
must  also  provide  scalability  for  its  servers  and  the  neces¬ 
sary  means  of  authentication.  And  finally,  it  must  not  im¬ 
pose  additional  overhead  on  its  users  and  the  administrators 
of  the  attached  HPC-nodes. 

Hence,  it  was  decided  to  use  the  standardized  Distributed 
Common  Environment  (DCE)  as  an  existing  and  reliable 
software  solution.  DCE/DFS  is  a  middleware,  created  by 
the  OSF  and  available  in  license  for  commercial  usage.  Cur¬ 
rently  we  are  using  the  Versions  1.0  and  1.1  of  DCE/DFS  on 
Solaris,  AIX  and  NT.  Versions  for  IRIX  and  UP-UX  were 
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Figure  4.  Screenshot  of  the  Java  User  Interface 


tested.  This  approach  provides  us  with  the  possibility  to 
include  any  DCE/DFS- capable  system  into  the  metacom¬ 
puter.  There  is  also  a  project  to  create  a  free  DCE/DFS  port 
for  Linux,  but  in  its  early  state  and  without  the  DFS  it  is  not 
usable  for  the  NRW-Metacomputer. 

An  administrative  domain  built  on  DCE  is  called  a  cell. 
DCE  provides  a  namespace  for  every  cell  that  allows  easy 
access  to  all  its  resources.  Cells  can  be  connected  using 
cross  cell  authentication  to  provide  a  secure  way  for  sharing 
resources  and  to  minimize  administrative  overhead.  Thus, 
users  can  authenticate  to  any  client  within  the  DCE  cell, 
submit  jobs  or  use  the  provided  distributed  file  system  DFS. 

Since  all  computers  within  the  metacomputer  are  part 
of  a  DCE  cell,  the  administrative  overhead  is  minimized 
and  can  be  distributed  throughout  the  metacomputer.  This 
is  possible  because  DCE  uses  access  control  lists  (ACL) 
for  administrative  commands  and  functions,  which  allow 
privileges  to  be  selectively  granted  to  either  individuals  or 
groups.  With  these  ACLs  it  is  for  example  possible  to  create 
administrative  accounts  for  special  tasks,  such  as  creating 
new  users  or  incorporating  new  clients,  without  the  need  to 
have  full  root  access  to  the  entire  cell. 

Communication  within  the  DCE  cell  is  secured  by  using 


encrypted  RPC  calls.  Because  of  the  restrictions  to  export 
encryption  algorithms  from  the  USA,  the  international  ver¬ 
sion  of  DCE/DFS  does  not  have  this  security.  Hence,  we 
decided  to  use  GSS  API  for  the  HPC  components  to  protect 
communication  across  the  public  internet. 

The  use  of  DFS  on  top  of  the  DCE  environment  offers 
several  advantages  for  the  NRW-Metacomputer.  All  users 
have  a  home  directory,  which  they  can  access  independently 
from  their  physical  location,  the  compute-servers  can  in¬ 
stantly  access  the  required  input  and  output  files,  and  in¬ 
stalled  software  packages  are  available  to  the  whole  meta¬ 
computer.  DFS  uses  the  DCE  ACLs  to  offer  a  high  level 
of  security  and  flexibility  for  file  or  data  access.  It  is  possi¬ 
ble  to  create  groups  who  have  the  same  set  of  permissions, 
which  lowers  the  administrative  workload.  The  actual  files 
within  the  DFS  are  stored  in  filesets ,  which  can  be  com¬ 
pared  to  Unix  filesystems  or  DOS  partitions.  These  filesets 
provide  key  features  for  a  distributed  environment.  Sev¬ 
eral  servers  can  host  a  specific  fileset  to  provide  scalability 
and  stability.  The  DFS-Server  keeps  these  so  called  replicas 
synchronized  with  the  original  fileset.  This  allows  fast  and 
up  to  date  access  to  HPC  Software  from  all  clients  within 
the  NRW-Metacomputer.  Filesets  can  be  backed  up  during 
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Figure  5.  Screenshot  of  the  Java  Administration  Interface 


normal  system  operations,  they  can  be  enlarged  or  even  re¬ 
located  from  one  DFS-Server  to  another  without  the  need 
to  stop  the  working  software,  which  is  importand  for  long 
running  jobs.  Given  these  facts,  a  centralized  backup  for 
all  data  within  the  cell  is  possible  and  is  used  for  the  NRW- 
Metacomputer. 

For  the  special  needs  of  the  NRW-Metacomputer  the 
DCE  internal  database  ( registry )  is  used  to  store  additional 
information  like  e.g.  user  accounts.  It  is  planned  to  ex¬ 
pand  this,  so  that  the  registry  will  contain  all  information 
needed  to  run  the  metacomputer.  For  instance,  this  may  be 
information  about  special  needs  or  restrictions  of  connected 
compute-servers  or  software.  Using  the  registry  offers  sev¬ 
eral  advantages,  because  all  its  data  is  accessible  within  the 
whole  cell  via  the  DCE  API  or  online  commands,  and  the 
database  itself  is  scalable  and  fault  tolerant. 

Since  DCE  provides  an  API,  software  can  be  adapted 
to  use  special  features,  such  as  sendmail,  the  apache  Web¬ 
server  or  samba.  Furthermore,  it  is  possible  to  incorporate 
DCE  into  any  other  proprietary  software.  DCE  has  a  stan¬ 
dard  command  line  interface,  which  is  called  dcecp,  the  dee 
controll  programm.  By  using  this  interface  in  a  batch  mode, 
it  is  possible  to  write  different  kinds  of  administrative  soft¬ 


ware,  such  as  a  frontend  using  HTML  and  CGI-scripts  for 
webservers,  or  a  XI 1 -Interface  using  Tcl/Tk  without  the 
need  to  directly  use  the  API  or  other  low-level  functions 
of  DCE.  The  dcecp  itself  is  written  using  Tel,  which  allows 
new  modules  to  be  developed  for  it. 

5  Accessing  the  Metacomputer  via  the  World 
Wide  Web 

An  important  aspect  of  the  idea  of  a  computational 
power  grid  is  the  freedom  for  its  users  to  access  the  sys¬ 
tem  from  wherever  they  want,  ideally  even  from  a  hotel 
room  in  a  remote  corner  of  the  world.  Hence,  we  decided  to 
use  Java  and  the  WWW-technology  for  implementing  the 
user  interface  of  the  NRW-Metacomputer  (Fig.  4).  In  or¬ 
der  to  provide  the  same  comfort  to  the  administrators  of  the 
metacomputer,  the  administration  interface  has  been  imple¬ 
mented  in  the  same  way  thereby  enabling  both  use  and  ad¬ 
ministration  of  the  metacomputer  from  whereever  there  is 
a  connection  to  the  Internet  (Fig.  5).  Furthermore,  both  in¬ 
terfaces  are  capable  of  storing  personal  customization  data 
within  the  metacomputer.  Thus,  users  always  find  their  own 
personal  access  interface,  no  matter  from  where  the  system 
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Figure  6.  Current  size  of  the  NRW-Metacomputer 


is  being  accessed. 

The  metacomputer  performs  its  own  user  management, 
authentication  and  authorization  based  on  the  DCE  infra¬ 
structure.  Thus,  users  will  only  have  to  login  once  for  each 
metacomputing  session.  If  the  machine  used  for  accessing 
the  system  runs  a  DCE-client,  users  can  even  work  with 
their  private  files  independent  from  their  current  location. 
This  is  therefore  the  recommended  way  of  working  with 
the  NRW-Metacomputer.  However,  if  the  DCE  services  are 
not  available,  user  files  can  be  transferred  to  and  from  the 
metacomputer  by  a  set  of  ftp-based  services,  which  we  have 
added  to  the  user  interface  (Fig.  1). 

6  Conclusions 

We  have  presented  the  NRW-Metacomputer  as  a  work¬ 
ing  connection  of  four  computing  centers  spread  all  over 
Nortrhine- Westphalia,  which  contains  heterogeneous  HPC 
resources  like  Cray  T3E,  IBM  SP2,  Sun  Enterprise,  or 
Siemens  hpcLine  (Fig.  6).  Among  the  important  aspects  of 
this  system  are  the  modular,  multi-tier  architecture  as  well 
as  its  powerful  scheduling  component  and  the  integration 
into  a  DCE-based  environment.  Knowing  that  there  exist 
several  other  metacomputing  environments  with  similar  ca¬ 
pabilities,  we  described  how  these  projects  can  benefit  from 
the  results  of  the  NRW-Metacomputer  initiative.  Possible 
aspects  are  the  re-use  of  concepts  or  modules  and  the  cre¬ 
ation  of  a  large  metacomputing  environment  by  using  the 


interfaces  of  the  NRW-Metacomputer. 

Besides  the  integration  of  more  sites  into  the  metacom¬ 
puter,  future  work  will  also  focus  on  the  development  and 
evaluation  of  improved  scheduling  strategies  on  top  of  the 
now  existing  infrastructure.  Furthermore,  we  will  employ 
the  service  interface  of  the  NRW-Metacomputer  to  add  new 
kinds  of  services  like  for  example  streaming  video  or  other 
real-time  data  sources. 
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Abstract 

In  this  paper  we  present  a  distributed  discovery  method 
allowing  individual  nodes  to  gather  information  about  re¬ 
sources  in  a  wide-area  distributed  system  made  up  of  au¬ 
tonomous  systems  linked  together  by  a  network  technology 
substrate.  We  introduce  an  algorithm  and  a  model  for  dis¬ 
tributed  awareness  and  a  framework  for  dynamic  assem¬ 
bly  of  agents  monitoring  network  resources.  Whenever  an 
agent  needs  detailed  information  about  individual  compo¬ 
nents  of  another  system  it  uses  the  information  gathered  by 
the  distributed  awareness  mechanism  to  identify  the  target 
system,  then  creates  a  description  of  a  monitoring  agent  ca¬ 
pable  of  providing  the  information  about  remote  resources, 
and  sends  this  description  to  the  remote  site.  There  an  agent 
factory  assembles  dynamically  the  monitoring  agent.  This 
solution  is  scalable  and  suitable  for  heterogeneous  environ¬ 
ments  where  the  architecture  and  the  hardware  resources  of 
individual  nodes  differ,  the  services  provided  by  the  system 
are  diverse,  the  bandwidth  and  the  latency  of  the  communi¬ 
cation  links  cover  a  broad  range. 


1.  Introduction 

In  this  paper  we  address  the  problem  of  resource  dis¬ 
covery  in  a  wide-area  distributed  system  made  up  of  au¬ 
tonomous  systems  linked  together  by  a  network  technology 
substrate.  The  system  is  heterogeneous,  the  architecture  and 
the  hardware  resources  of  individual  nodes  differ,  the  ser¬ 
vices  provided  by  the  system  are  diverse,  the  bandwidth  and 
the  latency  of  the  communication  links  cover  a  broad  range. 

Individual  nodes  in  such  a  distributed  system  may  co¬ 
operate  to  accomplish  tasks  that  require  resources  above 
and  beyond  those  available  in  any  single  node,  clients  and 
servers  may  need  to  negotiate  the  quality  of  service,  system 
administrators  may  wish  to  gather  synthetic  data  regarding 
resource  utilization  to  identify  bottlenecks.  A  data  intensive 
problem  may  generate  a  request  to  assemble  dynamically  a 


cluster  of  workstations  with  a  compound  CPU  rate,  mem¬ 
ory,  and  secondary  storage  space  determined  by  the  prob¬ 
lem  size.  A  system  administrator  may  wish  to  determine  the 
overall  secondary  storage  utilization  in  a  virtual  Intranet. 

Resource  management  in  a  distributed  system  can  be  del¬ 
egated  to  a  subset  of  nodes  providing  site-coordination,  ne¬ 
gotiation,  resource  monitoring,  and  other  services.  For  ex¬ 
ample  the  Open  Data  Network,  ODN,  model  [13]  is  based 
upon  an  hourglass  architecture  with  four  layers:  applica¬ 
tions,  middleware  services,  transport  services,  and  bearer 
services  provided  by  LANs,  wireless  networks,  ATMs, 
satellite  networks  and  so  on.  The  architecture  is  conceived 
to  support  services  ranging  from  teleconferencing  to  finan¬ 
cial  services,  from  remote  login  to  interactive  education. 
In  turn  middleware  services  cover  security,  name  services, 
multi-site  coordination,  file  systems  and  so  on,  and  use 
transport  services  for  video,  audio,  text,  fax,  and  other  types 
of  data.  The  diversity  of  the  networking  substrate,  the  het¬ 
erogeneity  and  autonomy  of  the  nodes,  the  variety  of  ser¬ 
vices  provided  by  the  system  make  all  aspects  of  resource 
management  in  this  model  rather  challenging  and  motivate 
the  desire  to  search  for  solutions  that  are  more  scalable  and 
able  to  accommodate  rapidly  changing  heterogeneous  envi¬ 
ronments. 

Distributed  algorithms  for  resource  management  have 
been  known  for  some  time.  The  flooding  algorithm  is  used 
by  routers  in  the  Internet,  broadcasting  by  local  queries, 
known  as  “gossiping”  [11],  [15]  have  been  used  to  main¬ 
tain  consistency  in  replicated  databases  [3]  and  to  gather 
information  about  system  failures  [4], 

Autonomous  and  mobile  software  agents  are  widely  re¬ 
garded  as  necessary  components  of  large-scale  distributed 
systems.  Agents  can  facilitate  access  to  existing  services 
to  thin  clients,  support  nomadic  computing,  perform  func¬ 
tions  related  to  resource  management,  support  negotiations 
among  several  parties  involved  in  a  transaction,  reconfig¬ 
ure  servers,  and  so  on.  For  example  mobile  agents  to  map 
network  topology  were  proposed  in  [14]. 

Autonomy  implies  that  the  agents  are  active  objects  with 
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their  own  tread  of  control,  they  can  exhibit  intelligent  be¬ 
havior.  Mobility  ensures  that  the  agents  can  operate  in 
rapidly  changing  heterogeneous  environments.  Yet,  ensur¬ 
ing  code  mobility  in  a  heterogeneous  environment  when  the 
architecture  of  the  nodes  is  different  and  we  have  several 
operating  systems  installed  is  a  non-trivial  endeavor. 

The  implicit  assumption  of  agent-based  solution  for  re¬ 
source  discovery  in  a  wide-area  system  is  the  existence  of 
a  framework  for  the  interoperabilty  of  different  agent  fami¬ 
lies,  like  the  one  proposed  in  [1].  Throughout  this  paper  we 
assume  that  a  system  like  the  one  described  in  Section  3.1  is 
installed  in  every  node  and  the  system  has  an  agent  factory, 
an  object  able  to  respond  to  external  requests  and  assemble 
agents  based  upon  a  description  of  an  agent  provided  by  the 
entity  that  initiated  the  request. 

In  this  paper  we  introduce  an  agent-based  model  for  re¬ 
source  discovery.  Agents  running  at  individual  nodes  learn 
about  the  existence  of  each  other  using  a  mechanism  called 
distributed  awareness.  Each  agent  maintains  information 
about  the  other  agents  it  has  communicated  with  over  a 
period  of  time  and  exchange  periodically  this  information 
among  themselves.  Whenever  an  agent  needs  detailed  in¬ 
formation  about  individual  components  of  the  system  we 
use  the  information  gathered  by  the  distributed  awareness 
mechanism  and  then  assemble  dynamically  agents  capable 
of  reporting  the  state  of  remote  resources  and  to  negotiate 
the  use  of  these  resources.  The  remote  agent  creation  and 
surgery  techniques  discussed  in  Section  3.3  are  general  and 
allow  us  to  alter  drastically  the  behavior  of  an  agent.  For 
example  we  can  add  additional  planes  for  resource  nego¬ 
tiations  with  clients  and  with  the  local  resource  manager, 
planes  to  reconfigure  a  local  server  and  so  on. 

The  contributions  of  this  paper  are  an  algorithm  and  a 
model  for  the  distributed  awareness  and  a  framework  for 
dynamic  assembly  of  agents  capable  of  providing  detailed 
information  about  network  resources. 

The  rest  of  this  paper  is  structured  as  follows.  Section 
2  reviews  some  of  the  existing  algorithms  for  resource  dis¬ 
covery,  presents  their  basic  assumptions  and  relevant  per¬ 
formance  measures.  Then  it  presents  our  distributed  aware¬ 
ness  algorithm  and  two  models  for  its  behavior.  Section  3 
introduces  the  agent-based  resource  discovery  architecture 
and  describes  an  implementation  based  upon  Bond  [6],  a 
component-based  agent  framework. 

2.  Algorithms  and  Models  for  Distributed 
Awareness 

A  first  step  in  all  applications  that  require  some  knowl¬ 
edge  about  the  other  nodes  of  a  network  is  to  learn  about 
the  existence  of  each  other.  We  call  this  process  “distributed 
awareness”,  while  other  authors  [10]  refer  to  it  as  resource 
discovery.  We  believe  that  in  a  heterogeneous  environment 


learning  about  the  existence  of  other  nodes  is  only  the  first 
step  in  a  complex  process  and  that  resource  discovery  re¬ 
quires  a  set  of  progressively  more  intricate  interactions  with 
the  newly  discovered  object. 

2.1.  Related  work 

We  review  briefly  some  of  the  algorithms  presented  in 
the  literature,  their  basic  assumptions,  and  the  proposed  per¬ 
formance  measures  to  evaluate  an  algorithm.  Virtually  all 
algorithms  model  the  distributed  system  as  a  directed  graph, 
in  which  each  machine  is  a  node  and  edges  represent  the  re¬ 
lation  “machine  A  knows  about  machine  B”.  The  network 
is  assumed  to  be  weakly  connected  and  communication  oc¬ 
curs  in  synchronous  parallel  rounds. 

One  performance  measure  is  the  running  time  of  the  al¬ 
gorithm,  namely  the  number  of  rounds  required  until  ev¬ 
ery  machine  learns  about  every  other  machine.  The  amount 
of  communication  required  by  the  algorithm  is  measured 
by:  (a)  the  pointer  communication  complexity  defined  as  the 
number  of  pointers  exchanged  during  the  course  of  the  al¬ 
gorithm,  and  (b)  the  connection  communication  complexity 
defined  by  the  total  number  of  connections  between  pairs  of 
entities. 

The  flooding  algorithm  assumes  that  each  node  v  only 
communicates  over  edges  connecting  it  with  a  set  of  initial 
neighbors,  T(v).  In  every  round  node  v  contacts  all  its  ini¬ 
tial  neighbors  and  transmits  to  them  updates,  Y(v)updates 
and  then  updates  its  own  set  of  neighbors  by  merging  F(v) 
with  the  set  {r(u)uprfa*e5},  with  u  G  T(u).  The  number  of 
rounds  required  by  the  flooding  algorithm  is  equal  with  the 
diameter  of  the  graph. 

The  swamping  algorithm  allows  a  machine  to  open  con¬ 
nections  with  all  their  current  neighbors  not  only  with  the 
set  of  initial  neighbors.  The  graph  of  the  network  known  to 
one  machine  converges  to  a  complete  graph  on  0(log(n)) 
steps  but  the  communication  complexity  increases.  Here  n 
is  the  number  of  nodes  in  the  network. 

In  the  random  pointer  jump  algorithm  each  node  v  con¬ 
nects  a  random  neighbor,  u  G  T(v)  who  sends  T(u)  to  v 
who  in  turn  merges  T(v)  with  T(u).  A  version  of  the  al¬ 
gorithm  called  the  random  pointer  jump  with  back  edge  re¬ 
quires  u  to  send  back  to  v  a  pointer  to  all  its  neighbors. 
There  are  even  strongly  connected  graphs  that  require  with 
high  probability  fi(n)  time  to  converge  to  a  complete  graph 
in  the  random  pointer  jump  algorithm. 

The  Name-Dropper  algorithm  is  proposed  in  [10].  Dur¬ 
ing  each  round  each  machine  v  transmits  T(v)  to  one  ran¬ 
domly  chosen  neighbor.  A  machine  u  that  receives  T(v) 
merges  T(v)  with  T(u).  In  this  algorithm  after  0(log2n) 
rounds  the  graph  evolves  into  a  complete  graph  with  proba¬ 
bility  greater  than  1  -  1  /(n°^). 
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2.2.  Distributed  Awareness;  Algorithm  and  Models 

2.2.1  A  Distributed  Awareness  Algorithm 

Distributed  awareness  is  a  mechanism  for  the  nodes  of  a 
message-passing  distributed  system  to  learn  about  the  ex¬ 
istence  of  each  other.  Each  node  maintains  an  awareness 
table  and  exchanges  the  information  in  this  table  with  other 
nodes.  An  entry  in  the  awareness  table  contains:  (1)  Node 
location,  the  IP  address  of  a  node,  (2)  lastHeard- 
From,  the  time  when  we  last  heard  from  the  node,  and  (3) 
last  Sync  the  time  when  the  awareness  information  was 
last  sent  to  the  node.  The  awareness  information  is  pig¬ 
gybacked  onto  regular  messages  exchanged  between  two 
nodes. 

Incoming/outgoing  message  handling  and  table  merging 
are  discussed  now.  The  algorithm  to  add  new  or  update  ex¬ 
isting  items  is: 

for  every  incoming  message 
find  sender,  S 

if  the  local  awareness  table  has  an  item  I  with 
the  same  node  location  as  S 

set  las tHeardFrom  of  I  as  current  time 
else 

add  a  new  item  initialized  with  S  and  las t- 
HeardFrom  set  as  current  time 

if  the  incoming  message  has  piggybacked  awareness 
information 

execute  table  merging  algorithm 

The  table  merging  algorithm  is: 

for  each  awareness  item,  I,  of  the  piggybacked 
awareness  table 

if  the  local  table  has  item  Iiocai  with  the  same 
node  location  of  I 

set  las  tHeardFrom  of  7/oca/  with  more  recent 
time  stamp  between  those  of  Iiocai  and  I 
else 

add  /  to  the  local  table  with  lastSync  set 

zero 

The  outgoing  message  handling  algorithm  appends  the 
local  awareness  table  to  the  outgoing  message: 
for  an  outgoing  message  Moutgoing  destined  to  a 
node  N 

look  up  an  item  I  with  node  location  N  in  the 
local  table 

if  lastSync  of  /  reached  a  specified  age, 
add  the  local  table  to  Moutgoing 
set  lastSync  of  I  as  current  time 

Send  Out  M outgoing 

Notice  that  lastSync  is  checked  to  control  the  interval 
between  sending  awareness  information  and  that  the  aware¬ 


ness  table  is  periodically  purged  based  upon  las  tHeard¬ 
From  field. 

2.2.2  Deterministic  and  Non-deterministic  Models 

Modeling  and  analysis  of  the  distributed  awareness  algo¬ 
rithm  is  rather  difficult.  The  problem  is  unstructured,  in  the 
general  case  we  do  not  know  either  the  network  topology 
or  the  communication  patterns  among  nodes  thus  it  is  rather 
difficult  to  simplify  assumptions  leading  to  a  tractable  anal¬ 
ysis.  Yet  we  need  to  get  a  rough  idea  of  the  overhead  in¬ 
curred  by  this  method  and  the  asymptotic  properties  of  the 
algorithm.  Intuitively  we  expect  that  after  some  time  all 
agents  will  learn  about  the  existence  of  all  other  agents. 

To  model  the  distributed  awareness  we  propose  to  use 
models  similar  to  the  ones  for  the  spread  of  a  contagious 
disease.  An  epidemic  develops  in  a  population  of  fixed  size 
consisting  of  two  groups  the  infected  individuals  and  the  un¬ 
infected  ones.  The  progress  of  the  epidemic  is  determined 
by  the  interactions  between  these  two  groups. 

We  introduce  first  a  deterministic  model.  Given  a  group 
of  n  nodes  this  simple  model  is  based  upon  the  assumption 
that  the  rate  of  change  in  agent’s  awareness  list,  is  propor¬ 
tional  with  the  size  of  the  group  the  agent  is  already  aware 
of,  y,  and  also  with  the  size  of  the  group  the  agent  is  un¬ 
aware  of,  n  —  y.  If  A;  is  a  constant  we  can  express  this  rela¬ 
tion  as  follows: 

y(ty  =  kx  y(t)  x  (n  -  y(t)) 

The  solution  of  this  differential  equation  with  the  initial 
condition  y(0)  =  0  is: 

y(t)  =  7—7 — rc— 

1  +  (n  -  l)e  Knt 

This  function  is  plotted  in  Figure  1  and  shows  that  after 
time  r  a  node  becomes  aware  of  all  the  other  nodes  in  the 
network.  The  parameter  k  as  well  as  the  value  r  can  be 
determined  through  simulation. 

Call  7]  the  ratio  of  the  awareness  information  exchanges 
to  the  total  number  of  instances  an  agent  communicates 
with  other  agents.  A  typical  value  for  this  parameter  is 
77  =  0.001.  If  the  amount  of  awareness  information  is  only 
a  fraction  6,  say  b  =  0.1  of  the  payload  carried  out  dur¬ 
ing  communication  between  two  agents,  it  follows  that  the 
additional  load  due  to  the  distributed  awareness  is  only  a 
small  fraction,  in  our  example  only  rj  x  b  =  0.01%  of  the 
total  network  traffic. 

This  deterministic  model  allows  only  a  qualitative  analy¬ 
sis.  Rather  than  the  smooth  transition  from  0  to  n  we  should 
expect  a  series  of  transitions  each  one  corresponding  to  a 
batch  of  newly  discovered  agents.  Yet  this  simple  model 
provides  some  insight  into  the  overhead  incurred  during  the 
learning  phase  of  the  awareness  mechanism  we  propose. 
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Figure  1.  The  number  of  agents  known  to  a 
given  agent,  function  of  time,  using  a  deter¬ 
ministic  distributed  awareness  model.  After 
time  t,  each  agent  becomes  aware  of  all  the 
other  agents  in  the  network 


A  non-deterministic  model  is  sketched  below.  New  ac¬ 
quaintances  occur  in  batches  at  time  intervals  determined 
by  the  overall  rate  of  information  exchange  among  nodes 
and  by  tj.  Call  p  the  probability  of  contact  between  two 
agents  such  that  as  a  result  of  the  contact  the  awareness  list 
are  modified,  and  let  q  =  1  -  p.  Assume  that  the  contacts 
between  agents  are  stochastically  independent  and  observe 
that  the  probability  that  among  the  i  entries  in  the  list  sup¬ 
plied  to  an  agent,  k,<  i  entries  are  not  already  in  its  list 
is 

(*)  x  Pk  x 

Call  Y ( s )  the  random  variable  denoting  the  number  of 
entries  in  the  list  of  the  “typical”  agent  at  discrete  time  s  = 
1,2,....  Then 

P{Y(s  +  1)  =  j\Y(s)  =i)=  (*)  x  tf-i  x  qi 

if  i  >  j  and  zero  otherwise. 

The  probability  distribution  of  Y(s  +  1)  is  independent 
of  the  values  assumed  by  the  random  variables  F(r),  r  < 
s.  Therefore  (y(s))s>o  is  a  Markov  chain  with  states 
0,1, . n  and  the  transition  matrix  is: 
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Once  a  large  system  is  operational  we  can  attempt  to  de¬ 
termine  the  parameters  of  the  model,  including  the  transi¬ 
tion  probabilities,  and  then  validate  the  model.  The  large 
number  of  parameters  make  this  model  very  cumbersome 
for  analysis  of  a  realistic  system,  with  a  large  number  of 
nodes.  The  model  is  useful  for  theoretical  studies,  assum¬ 
ing  different  communication  patterns,  but  this  is  beyond  the 
scope  of  this  paper. 

3.  Monitoring  Agents  and  Resource  Discovery 

Information  about  the  resources  and  the  state  of  the 
nodes  of  a  wide  area  distributed  system  is  sometimes 
needed  to  coordinate  the  activity  of  a  group  of  nodes,  to 
provide  synthetic  information  about  resource  utilization,  or 
for  other  needs.  A  common  approach  taken  by  commercial 
as  well  as  research  systems  is  to  install  on  each  node  a  mon¬ 
itor  to  gather  local  resource  information.  The  local  monitors 
may  update  periodically  a  centrally  stored  database  or  pro¬ 
vide  the  information  on  demand.  Sometimes  the  informa¬ 
tion  may  be  stored  in  servers  hierarchically  arranged. 

Several  metacomputing  projects  [9],  [7]  rely  on  a  group 
of  central  entities  to  maintain  the  resource  information  re¬ 
ported  by  local  entities.  Globus  [9]  provides  a  Metacomput¬ 
ing  Directory  Service  where  network  resource  information 
is  stored  in  a  tree-like  structure  and  it  is  accessible  using  the 
Lightweight  Directory  Access  Protocol  [16].  Local  moni¬ 
tors  residing  on  each  node  report  the  structure  and  state  of 
resources.  Monitors  have  to  be  installed  and  configured  for 
each  site.  Legion  [7]  uses  collections  as  repositories  for  in¬ 
formation  describing  the  state  of  the  resources  comprising 
the  system.  The  collection  is  a  database  of  static  informa¬ 
tion  reported  by  remote  monitors.  Resource  management 
software  provided  by  several  companies  including  Tivoli  [2] 
follow  the  same  paradigm. 

The  information  provided  by  a  local  monitor  is  deter¬ 
mined  at  the  time  the  monitoring  program  is  installed.  To 
provide  additional  information  the  program  must  be  modi¬ 
fied  and  reinstalled,  and  also  it  must  be  non-intrusive.  Often 
the  information  obtained  from  static  databases  is  obsolete. 
These  considerations  justify  the  need  to  investigate  alterna¬ 
tive  means  for  gathering  resource  information. 

Using  software  agents  for  resource  discovery  and  mon¬ 
itoring  has  several  advantages  over  the  traditional  ap¬ 
proaches  outlined  above.  Monitoring  agents  have  an  au¬ 
tonomous  behavior  and  evolve  based  upon  the  characteris¬ 
tics  of  the  local  system  and  the  requirements.  Agents  can 
engage  in  a  gradual  discovery  process  and  respond  to  a 
changing  set  of  requirements.  They  are  able  to  adapt  to  the 
architecture  and  the  operating  environment  of  local  nodes. 
An  agent  may  decide  its  behavior  based  upon  the  results  of 
an  inference  process  thus  the  tasks  assigned  can  be  rather 
complex. 
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Now  we  describe  an  agent-based,  distributed  resource 
discovery  architecture  where  agents  are  created  at  remote 
locations  and  modified  as  needed,  to  gather  the  information 
for  resource  management. 

3.1.  Bond;  a  Distributed  Object  System 

Bond  is  a  Java-based  distributed  object  system  and  agent 
framework,  with  an  emphasis  on  flexibility  and  perfor¬ 
mance.  It  is  composed  of  (a)  a  core  containing  an  ob¬ 
ject  model  and  message  oriented  middleware,  (b)  a  service 
layer  containing  distributed  services  like  directory  and  per¬ 
sistent  storage  services,  and  (c)  the  agent  framework  provid¬ 
ing  basic  tools  for  creating  autonomous  agents  and  a  set  of 
reusable  components,  called  strategies ,  from  which  devel¬ 
opers  assemble  agents  with  no  or  minimal  amount  of  pro¬ 
gramming. 

Bond  Core.  At  the  heart  of  the  Bond  system  there  is  a 
Java  Bean-compatible  component  architecture.  Bond  ob¬ 
jects  extend  Java  Beans  by  allowing  users  to  attach  new 
properties  to  the  object  during  runtime,  and  offer  a  uni¬ 
form  API  for  accessing  regular  fields,  dynamic  properties 
and  Java  Bean  style  setField/getField-defined  vir¬ 
tual  fields.  This  allows  programmers  the  same  flexibility 
like  languages  like  Lisp  or  Scheme,  while  maintaining  the 
familiar  Java  programming  syntax. 

Bond  objects  are  network  objects;  they  can  be  both 
senders  and  receivers  of  messages.  No  post-processing  of 
the  object  code  as  in  RMI  or  CORBA-like  stub  genera¬ 
tion,  is  needed.  Bond  uses  message  passing  while  RMI  or 
CORBA-based  component  architectures  use  remote  method 
invocation . 

The  system  is  largely  independent  from  the  message 
transport  mechanism  thus  several  communication  engines 
can  be  used  interchangeably.  We  currently  provide  TCP- 
based,  UDP-based,  Infospheres-based,  and,  separately,  a 
multicast  engine.  Other  communication  engines  will  be  im¬ 
plemented  as  needed.  The  API  of  the  communication  en¬ 
gine  allows  Bond  objects  to  use  any  communication  engines 
without  changing  or  recompiling  codes.  On  the  other  hand, 
the  properties  of  the  communication  engine  are  reflected  in 
applications  as  a  whole.  For  example  the  UDP-based  en¬ 
gine  offers  higher  performance  but  does  not  guarantee  reli¬ 
able  delivery. 

All  Bond  objects  communicate  by  an  agent  communi¬ 
cation  language,  KQML  [8].  Recently  XML-based  inter¬ 
agent  communication  was  provided  as  an  alternative  to 
KQML.  Bond  defines  the  concept  of  subprotocols ,  highly 
specialized,  closed  set  of  commands.  Subprotocols  gener¬ 
ally  contain  the  messages  to  perform  a  specific  task.  Ex¬ 
amples  of  generic  Bond  subprotocols  are  property  access 
subprotocol,  agent  control  subprotocol  or  security  subpro¬ 
tocol. 


Subprotocols  group  the  same  functionality  of  messages 
which  in  a  remote  method  invocation  system  would  be 
grouped  in  interface.  But  the  larger  flexibility  of  the  mes¬ 
saging  system  allows  for  several  new  techniques  which  are 
difficult  to  implement  in  the  remote  method  call  case: 

•  The  subprotocols  implemented  by  an  object  are  prop¬ 
erties  of  the  object,  thus  two  objects  can  use  the  prop¬ 
erty  access  subprotocol,  which  is  implemented  by  ev¬ 
ery  Bond  object,  to  find  the  common  set  of  subproto¬ 
cols  between  them. 

•  An  object  is  able  to  control  the  internal  path  of  a 
message  delivery  and  to  delegate  the  processing  of 
the  message  to  subcomponents  called  regular  probes. 
Regular  probes  can  be  attached  dynamically  to  an  ob¬ 
ject  as  needed. 

•  Messages  can  be  intercepted  before  they  are  delivered 
to  the  object,  thus  providing  a  convenient  way  for  fire 
wall,  accounting,  logging,  monitoring,  filtering  or  pre¬ 
processing.  These  operations  are  performed  by  sub¬ 
components  called  preemptive  probes. 

•  Subprotocols,  like  interfaces,  group  some  functionality 
of  the  object,  which  may  or  may  not  be  used  during  its 
lifetime.  A  subcomponent  called  autoprobe  allows  the 
object  to  instantiate  a  new  probe,  to  handle  an  incom¬ 
ing  message  which  could  not  be  understood  by  existing 
probes. 

•  Objects  can  be  addressed  by  their  unique  identifier,  or 
by  their  alias.  Aliases  specify  the  services  provided  by 
the  object  or  its  probes.  An  object  can  have  multiple 
aliases  and  multiple  objects  can  be  registered  under  the 
same  alias.  The  latter  enables  the  architecture  to  sup¬ 
port  load  balancing  services. 

These  techniques  can  be  implemented  through  different 
means  in  languages  which  treat  methods  as  messages,  e.g. 
Smalltalk.  In  Java  and  C++  they  can  be  implemented  at 
compile  time,  not  at  runtime,  e.g.  using  the  delegation  de¬ 
sign  pattern.  Techniques  from  the  recent  CORBA  specifica¬ 
tions  e.g.  the  simultaneous  use  of  DII,  POA,  trading  service 
and  others,  also  allow  to  implement  a  similar  functionality, 
but  with  a  larger  overhead,  and  significantly  more  complex 
code. 

Bond  Services.  Bond  provides  a  number  of  services 
commonly  found  in  distributed  object  systems,  like  direc¬ 
tory,  persistent  storage,  monitoring  and  security.  Event,  no¬ 
tification,  and  messaging  services,  which  provide  message 
passing  services  in  remote  method  invocation  based  systems 
are  not  needed  in  Bond,  due  to  the  message-oriented  archi¬ 
tecture  of  the  system. 
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Some  of  Bond  services  perform  differently  than  their 
counterparts  in  other  middleware  systems,  like  CORBA. 
For  example,  Bond  never  requires  explicit  registration  of  a 
new  object  with  a  service.  Finding  out  the  properties  of  a  re¬ 
mote  object,  i.e.  the  set  of  subprotocols  implemented  by  the 
object,  is  achieved  by  direct  negotiation  among  the  objects. 
The  directory  service  in  Bond  combines  the  functionality  of 
the  naming  and  trading  services  of  other  systems  and  it  is 
implemented  in  a  distributed  fashion.  Objects  are  located 
by  a  search  process  which  propagates  from  local  directory 
to  local  directory.  The  directories  are  linked  into  a  virtual 
network  by  a  transparent  distributed  awareness  mechanism, 
which  transfers  directory  information  by  piggybacking  on 
messages  as  discussed  in  the  previous  Section. 

Compared  with  the  naming  service  implementations  in 
systems  like  CORBA  or  RMI,  which  are  based  on  the  ex¬ 
istence  of  a  name  server,  this  approach  has  the  advantage 
that  there  is  no  single  point  of  failure,  and  the  distributed 
awareness  mechanism  reconstitutes  the  network  of  directo¬ 
ries  even  after  catastrophic  failures.  However,  a  distributed 
search  can  be  slower  than  lookup  on  a  server,  especially  for 
large  networks.  For  these  cases,  Bond  objects  can  be  reg¬ 
istered  to  external  directories,  either  to  a  CORBA  naming 
service  through  a  gateway  object,  or  to  external  directory 
services  based  on  LDAR 

3.2.  Bond  Agents 

The  Bond  agent  framework  is  an  application  of  the  facil¬ 
ities  provided  by  the  Bond  core  layer  to  implement  collab¬ 
orative  network  agents.  Agents  are  assembled  dynamically 
from  components  in  a  structure  described  by  a  multi-plane 
state  machine  [5].  This  structure  is  described  by  a  spe¬ 
cialized  language  called  blueprint.  Bond  also  supports 
agent  description  in  XML.  The  components  (strategies)  are 
loaded  locally  or  remotely,  or  can  be  specified  in  inter¬ 
pretive  programming  languages  embedded  in  the  blueprint 
script.  The  state  information  and  knowledge  base  of  the 
agents  are  collected  in  a  single  object  called  model  of  the 
world  which  allows  for  easy  checkpointing  and  migration  of 
agents.  The  multiplane  state  machine  describing  the  behav¬ 
ior  of  agents  can  be  modified  dynamically  by  agent  surgery, 
which  will  be  discussed  shortly. 

The  behavior  of  the  agent  is  described  by  the  actions 
the  agent  is  performing.  The  actions  are  performed  by 
the  strategies  either  as  reactions  to  external  events,  or  au¬ 
tonomously  in  order  to  pursue  the  agenda  of  the  agent.  The 
current  state  of  the  multiplane  state  machine  (described  by 
a  state  vector)  is  specifying  the  strategies  active  at  a  certain 
moment.  The  multiple  planes  are  a  way  of  expressing  par¬ 
allelism  in  Bond  agents.  A  good  technique  is  to  use  them 
to  express  the  various  facets  of  the  agents  behavior:  sens¬ 
ing,  reasoning,  communication/negotiation,  acting  upon  the 


environment  and  so  on.  The  transitions  in  the  agent  are 
modifying  the  behavior  of  the  agent  by  changing  the  cur¬ 
rent  set  of  active  strategies.  The  transitions  can  be  triggered 
by  internal  events  or  from  external  messages  -  these  external 
messages  form  the  control  subprotocol  of  the  agent. 

Strategies  are  reusable  by  having  interface  requirements. 
The  Bond  agent  framework  provides  a  strategy  database, 
for  the  most  commonly  used  tasks,  like  starting  and  con¬ 
trolling  external  agents  or  legacy  applications.  A  number  of 
base  strategies  for  common  tasks  like  dialog  boxes  or  mes¬ 
sage  handlers  are  also  provided,  which  can  be  sub-classed 
by  developers  to  implement  specific  functionality.  External 
algorithms,  especially  if  written  in  Java  are  usually  easily 
portable  to  the  strategy  interface. 

3.3.  Remote  Creation  and  Surgery  of  Monitoring 
Agents 

In  this  section,  we  discuss  the  remote  creation  of  an  agent 
and  its  surgery.  To  illustrate  the  concepts  outlined  in  Section 
3.2  we  present  the  creation  and  modification  of  a  monitor¬ 
ing  agent.  Several  entities  are  involved  in  this  process:  a 
beneficiary r  agent  at  the  site  where  the  resource  information 
is  needed,  an  agent  factory'  at  the  target  site,  and  possibly 
a  blueprint  repository .  The  target  site  is  identified  by  the 
distributed  awareness  or  by  a  name  or  directory  service.  To 
install  a  monitoring  agent  on  the  target  site,  the  beneficiary 
agent  needs  to  obtains  a  blueprint  of  a  monitoring  agent. 
The  blueprint  can  be  retrieved  from  the  blueprint  reposi¬ 
tory  or  created  dynamically  by  an  inferencing  agent  given  a 
set  of  rules  and  facts.  After  that,  a  message  containing  the 
blueprint  or  the  location  of  the  repository  and  the  blueprint 
name  is  sent  to  the  agent  factory.  Figure  2  illustrates  this 
process.  A  Bond  Resident  is  a  container  object  including 
directory,  communicator,  and  all  other  objects.  In  this  ex¬ 
ample  the  message  sent  by  the  beneficiary  agent  contains 
the  blueprint: 

{achieve  : content  assemble-agent 
: blueprint-program  [agent  blueprint]) 

The  beneficiary  agent  in  this  example  decides  to  create 
a  single  plane  monitoring  agent  with  the  blueprint  shown 
in  Figure  3.  Figure  4  shows  the  monitoring  agent  with  one 
plane  designed  to  gather  information  about  the  primary  stor¬ 
age,  e.g.  the  amount  of  physical  memory  available  in  the 
node,  the  amount  of  used  and  free  storage,  a  list  of  the  top 
users  of  memory,  and  so  on.  Notice  that  each  plane  de¬ 
scribes  a  state  machine. 

The  agent  factory  receives  the  message,  interprets  the 
blueprint,  and  creates  a  monitoring  agent  with  one  plane 
called  PrimaryStorage  using  one  strategy  included  in 
the  blueprint  as  JPython  program  [12],  associated  with 
MemoryCheck  state.  The  complete  JPython  strategy  is 
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Figure  2.  The  communication  between  Ben¬ 
eficiary  Agent,  the  Agent  Factory  and  the 
Blueprint  Repository.  Messages  instructing 
the  agent  factory  to  create  a  monitoring  agent 
(solid  line)  and  to  perform  surgery  (dotted 
line)  are  shown. 


create  agent  MonitoringAgent 
plane  PrimaryStroage 

add  state  Init  with  strategy  InitCheck; 
add  state  MemoryCheck  with  strategy  language 
python  embedded  { : 

def  getcmdresults (cmd) : 

' ' # ' ' 'Run  a  command  and  return  its  output 
as  a  string  and  exit  value 


def  vmstat ( ) : 

' ' ' ' ' 'Return  the  statistics 
'•••••  from  vmstat  output 
' ' ' ' ' 'in  form  of  a  hashtable 
[list,  exitcode]  = 

getcmdresults ( 'vmstat  1  2') 


shown  in  the  Appendix.  After  creating  the  agent,  the  agent 
factory  sends  back  an  acknowledgment  to  the  beneficiary 
agent. 

Once  started,  the  agent  performs  a  transition  to  the  Mem¬ 
oryCheck  state.  The  Jpython  strategy  identifies  the  op¬ 
erating  system  running  on  that  node  and  invokes  the  sys¬ 
tem  calls,  e.g.  vmstat  in  Unix,  necessary  to  gather  the 
information  about  the  primary  storage.  If  successful,  the 
state  machines  performs  a  transition  to  the  MemoryRe- 
port  state  with  strategy  ReportPS,  and  sends  back  the 
information  to  the  beneficiary  agent  named  in  the  Bene- 
f  iciaryAddress  of  the  blueprint  and  finishes  its  execu¬ 
tion  by  transition  to  the  Done  state  with  the  End  strategy. 

The  primary  storage  map  changes  in  real  time,  thus  it 
might  be  desirable  to  have  an  agent  capable  of  reporting  the 
information  periodically.  In  addition,  it  may  be  necessary 
to  gather  information  about  secondary  storage,  e.g.  the  total 
amount  of  disk  space  available,  the  amount  in  use,  the  free 
disk  space,  the  number  of  file  systems,  etc. 

To  obtain  the  periodic  memory  report  and  the  secondary 
storage  information,  the  agent  can  be  modified  through 
surgery  as  shown  in  Figure  5.  In  our  example  we  (a)  add 
another  plane,  called  SecondaryStorage,  to  report  the 
amount  of  free  secondary  storage  space,  and  (b)  modify  the 
memory  plane  by  adding  transition  from  MemoryReport 
state  to  MemoryCheck  state  while  deleting  Done  state  and 
gotoEnd  transition.  As  a  result,  the  agent  reports  period¬ 
ically  the  state  of  the  primary  storage.  The  reporting  inter¬ 
val  is  specified  in  the  blueprint  as  Interval,  in  this  case, 
5000  msec. 

To  perform  the  surgery,  we  send  the  agent  factory  at  the 


def  save (map,  prefix  =  #/): 

' ' ' ' ' ' Save  a  hashtable  into  model 


save (vmstat () ,  'discover.') 

self . fsm. transition ( ' ' gotoReport ' ' ) ; 

:} 

add  state  MemoryReport  with  strategy  ReportPS; 
add  state  Done  with  strategy  End; 

internal  transitions  { 

from  InitCheck  to  MemoryCheck  on  gotoCheck; 
from  MemoryCheck  to  MemoryReport 
on  gotoReport; 

from  MemoryReport  to  Done  on  gotoDone; 

} 

model  { 

Bene f iciaryAddress  = 

1 'ResourceAgent@peter . cs .purdue.edu: 2000'  ' 

> 

end  plane; 
end  create . 


Figure  3.  The  blueprint  of  a  monitoring  agent 
designed  to  gather  information  about  avail¬ 
able  physical  memory,  the  amount  of  used 
and  free  storage,  and  a  list  of  top  memory 
users 
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Monitoring  Ag*nt 


Figure  4.  The  monitoring  agent  built  using  the 
blueprint  in  Figure  3.  The  strategies  associ¬ 
ated  with  every  state  are  shown  in  parenthe¬ 
sis. 


modify  agent  Probing 

plane  SecondaryStorage 

add  state  Init  with  strategy  InitSS; 

add  state  StorageCheck 

with  strategy  MeasureSS; 
add  state  StorageReport 
with  strategy  ReportSS; 
internal  transitions  { 

from  InitSS  to  StorageCheck  on  gotoCheck; 
from  StorageCheck  to  StorageReport 
on  gotoReport; 

from  StorageReport  to  StorageCheck 
on  gotoCheck; 

> 

end  plane; 

plane  PrimaryStorage 
delete  state  Done; 
internal  transitions  { 

delete  from  MemoryReport  to  Done 
on  gotoEnd; 

from  MemoryReport  to  MeamoryCheck 
on  gotoCheck; 

} 

model  { 

Interval  =  5000; 

} 

end  plane; 

Figure  5.  The  agent  surgery  script.  S  second 
plane,  SecondaryStorage  is  added  and  state 
machine  of  the  first  plane,  PrimaryStorage 
is  modifyed. 


Figure  6.  The  agent  after  the  surgery 

target  site  the  following  message: 

(achieve  : content  modify-agent  :bondID  [agent  ID] 

: blueprint-program  [agent  surgery  script]) 

The  message  contains  the  unique  Bond  ID  of  the  agent. 
This  allows  the  agent  factory  to  identify  the  target  of  the 
surgery  request.  Figure  6  shows  the  monitoring  agent  after 
the  surgery  of  Figure  5.  Agent  surgery  involves  the  modifi¬ 
cation  of  the  data  structure  used  to  control  the  scheduling  of 
various  strategies  in  the  planes  of  the  agent.  The  surgery  can 
be  performed  while  the  agent  is  running  and  the  blueprint  of 
the  modified  agent  can  be  generated. 

4.  Conclusions 

Information  about  the  topology,  resources  and  the  state 
of  the  nodes  of  a  wide  area  distributed  system  is  sometimes 
needed  to  coordinate  the  activity  of  a  group  of  nodes,  to  pro¬ 
vide  synthetic  information  about  resource  utilization,  or  for 
other  needs.  A  common  approach  taken  by  commercial  as 
well  as  research  systems  is  to  install  on  each  node  a  moni¬ 
tor  to  gather  local  resource  information.  The  local  monitors 
may  update  periodically  a  centrally  stored  database  or  pro¬ 
vide  the  information  on  demand. 

Using  software  agents  for  resource  discovery  and  mon¬ 
itoring  has  several  advantages  over  the  more  traditional 
approach  outlined  above.  Monitoring  agents  have  an  au¬ 
tonomous  behavior  and  evolve  based  upon  the  characteris¬ 
tics  of  the  local  system  and  the  requirements  of  the  benefi¬ 
ciary  agent.  Agents  can  engage  in  a  gradual  discovery  pro¬ 
cess  and  respond  to  a  changing  set  of  requirements.  They 
are  able  to  adapt  to  the  architecture  and  the  operating  en¬ 
vironment  of  the  local  node.  An  agent  may  change  its  be¬ 
havior  based  upon  the  results  of  an  inference  process  and 
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the  tasks  assigned  to  an  agent  can  be  rather  complex.  On 
the  other  hand,  the  amount  of  resources  used  by  the  agency 
may  be  larger  than  resources  required  by  a  custom-made 
monitoring  system. 

In  this  paper  we  introduce  an  agent-based  model  for  re¬ 
source  discovery.  Agents  running  at  individual  nodes  learn 
about  the  existence  of  each  other  using  a  mechanism  called 
distributed  awareness .  Each  agent  maintains  information 
about  the  other  agents  it  has  communicated  with  over  a 
period  of  time  and  exchange  periodically  this  information 
among  themselves.  Whenever  an  agent  needs  detailed  in¬ 
formation  about  individual  components  of  the  system  we 
use  the  information  gathered  by  the  distributed  awareness 
mechanism  and  then  assemble  dynamically  agents  capable 
of  reporting  the  state  of  remote  resources  and  to  negotiate 
the  use  of  these  resources.  The  remote  agent  creation  and 
surgery  techniques  are  general  and  allow  us  to  alter  drasti¬ 
cally  the  behavior  of  an  agent. 

We  present  two  models  for  distributed*  awareness,  a  de¬ 
terministic  model  that  supports  a  qualitative  analysis  and  a 
more  intricate,  quantitative  model.  We  introduce  the  Bond 
system  and  discuss  the  assembly  and  surgery  of  a  monitor¬ 
ing  agent  capable  to  report  the  use  of  primary  and  secondary 
storage. 

The  Bond  systems  is  available  under  an  open  source  li¬ 
cense  from  http  :  /  /  bond .  cs  .  purdue .  edu. 

5.  Acknowledgments 

The  work  reported  in  this  paper  was  partially  supported 
by  a  grant  from  the  National  Science  Foundation,  MCB- 
9527131 ,  by  the  Scalable  I/O  Initiative,  and  by  a  grant  from 
the  Intel  Corporation. 

6*  Appendix:  a  JPython  Strategy  to  Gather 
Memory  Information 

add  state  Memory-Check  with  strategy  language 
python  embedded 
{: 

from  java.lang  import  Runtime,  StringBuffer 
from  java.io  import  InputStream,  StringWriter 

import  string 

def  getcmdresults (cmd) : 

"""Run  a  command  and  return  its  output 
as  a  string  also  return  the  exit 
value  as  the  tuple's  second  arg 
Runtime . exec { )  writes  some  nonsense 
on  standard  output  at  least  in  Linux 

p  =  Runtime .getRuntime ( ) .exec_(cmd) 
p.waitFor ( ) 

output  =  p . getlnputStream ( ) 


buf  =  StringWriter ( ) 
c  =  output . read ( ) 
while  c  ! =  -1: 
buf .write (c) 
c  =  output . read ( ) 

return  (buf . getBuf f er ( ) . toString ( ) ,  p . exitValue ( ) ) 

def  accustat (param) : 

. Accumulate  information  about  users 

and  return  a  hashtable  requires  System  V 
ps  (Solaris  2.x,  newest  Linux)  see  man  page 
for  parameter  names,  try  e.g.  pmem 

If  II  » 

[list,  exitcode] 

=  getcmdresults ( 'ps  -eo  user,'+  param) 
if  exitcode  >  0: 
return  None 

broken  =  string. split (list,  '\n') 
map  =  {} 

for  line  in  broken [1:] : 

spl  =  string. split (line) 
if  len(spl)  !=  2: 
continue 

[user,  param]  =  spl 
if  map .has_key (user) : 

map[user]  =  map[userj  +  string. atof (param) 

else : 

map [user]  =  string . atof (param) 
return  map 

def  vmstat ( ) : 

.  Return  the  statistics  from  the  vmstat 

output  in  form  of  a  hashtable.  See  manual 
page  for  the  meanings  of  the  keys  (system 
dependent  although  some  are  common) . 

[list,  exitcode] 

=  getcmdresults ( 'vmstat  1  2') 
if  exitcode  >  0: 
return  None 

broken  =  string. split (list,  '\n') 
names  =  string. split (broken [ 1 ] ) 
values  =  string. split (broken [3 ] ) 
map  =  {} 
i  =  0 

for  name  in  names : 

map [name]  =  string. atoi (values [i] ) 
i  =  i  +  1 
return  map 

def  save (map,  prefix  =  ''): 

. .  save  a  hashtable  into  the  model  with 

optional  prefix  (should  include  the  dot) 

for  name  in  map . keys ( ) : 

model . set (prefix  +  name,  map [name]) 
save (vmstat () ,  'discover.') 
self . fsm. transition ( "gotoReport " ) 
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ABSTRACT 

Management  of  large-scale  parallel  and  distributed 
applications  is  an  extremely  complex  task  due  .to 
factors  such  as  centralized  management  architectures , 
lack  of  coordination  and  compatibility  among 
heterogeneous  network  management  systems ,  and 
dynamic  characteristics  of  networks  and  application 
bandwidth  requirements .  The  development  of  an 
integrated  network  management  framework  that  is 
proactive ,  scalable  and  robust  is  a  challenging 
research  problem .  In  this  paper ,  we  present  our 
approach  to  implement  a  Proactive  Application 
Management  System  (PAMS).  PAMS  architecture 
consists  of  two  main  modules:  Application  Centric 
Management  (ACM)  and  Management  Computing 
System  (MCS).  The  ACM  module  provides  the 
application  developers  with  all  the  tools  required  to 
specify  the  appropriate  management  schemes  to 
manage  any  quality  of  service  requirement  or 
application  attribute/functionality  (e.g.,  performance , 
fault ,  security ,  etc.).  The  MCS  provides  the  core 
management  services  to  enable  the  efficient  proactive 
management  of  a  wide  range  of  network  applications. 
The  services  offered  by  the  MCS  are  implemented 
using  mobile  agents.  Furthermore ,  each  MCS  service 
can  be  implemented  using  several  techniques  that  can 
be  selected  dynamically  by  invoking  the  corresponding 
mobile  agent  template  for  the  service  implementation. 
In  this  paper,  we  present  our  preliminary  results  of 
evaluating  PAMS  management  services  to  manage  the 
performance  and  fault  tolerance  execution  of  three 
applications  of  different  sizes  (small,  medium  and 
large).  The  experimental  results  demonstrate  that  our 
agent-based  approach  can  lead  to  significant  gains  in 
the  performance  and  low  overhead  fault  management 
of  parallel/distributed.  For  example,  the  overhead 
incurred  in  the  application  fault  management  to 
tolerate  one  task  failure,  two  task  failures,  and  three 
task  failures  in  a  medium  to  large  size  application  is 
less  than  0.02%. 
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1.  Introduction 

The  emerging  high  speed  networks  and  the- 
advances  in  computing  technology  are  important 
driving  forces  to  merge  the  communications  and 
computing  technologies  that  will  result  in  an  explosive 
growth  in  network  complexity,  size  and  networked 
applications.  Furthermore,  we  are  observing  an 
explosive  growth  in  network  applications  that  use 
computing,  networking  and  storage  resources  that  can 
be  accessed  from  global  national  and/or  international 
networks.  The  management  of  such  networks  and  their 
distributed  applications  has  become  increasingly 
complex,  and  unmanageable.  Unfortunately,  the 
current  network  management  technologies  focus  on 
collecting  management  information  and  manually 
manage  the  network  using  platform-specific  products. 
There  has  been  little  research  toward  the  development 
of  intelligent,  efficient,  proactive  end-to-end 
management  of  large  networks  and  their  applications. 

The  increased  importance  of  network  management 
for  large-scale  networks  has  stimulated  research  on 
novel  approaches  to  reduce  the  management 
complexity  and  cope  with  dynamic  management 
change.  Instead  of  a  centralized  manager,  multi¬ 
managers  and  their  communication  protocols  are 
proposed  such  as  Management  by  Delegation 
(MbD)[4]  and  Code  Mobility[5].  Another  approach 
replaces  the  manger-agent  relationship  among 
managers  and  agents  with  peer-to-peer  relationship 
using  the  Common  Object  Request  Broker 
Architecture  (COR^A)  has  been  studied  in  the  area  of 
Telecommunications  Information  Networking 
Architecture  (TINA)  framework  [2].  A  few  web-based 
approaches  to  network  management  have  emerged 
recently  (JMAPI,  WEBEM).  [3]. 

However,  distributed  network  management  of 
applications  over  heterogeneous  has  not  fully  studied 
and  is  becoming  increasingly  important.  Recently, 
Application  Management  MIB  [7]  and  MTB  for 
Application  [6]  have  been  proposed  to  collect  and 
store  common  application  management  information  in 
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IETF.  Common  Information  Model  (CIM)  by  DMTF 
is  proposed  a  similar  process  information  definition 
for  WBEM  [Patrck98].  Still,  there  has  been  little  work 
done  to  achieve  programmable  application 
management  schemes  and  is  not  well  understood. 


ACM  Layer 


CUSP  WBEM 


ADM;  Appficatkn  Delegated  Manager 
TA1.it:  TaskAgentl.n 
TIB:  Task  Information  Base 
S:  Senscr 
A  AdJatar 

Figure  1.  The  Runtime  Architecture  of  the  Proactive 
Application  Management  System. 


In  this  paper,  we  present  the  design  and  evaluation  of  a 
Proactive  Application  Management  System  (PAMS) 
prototype  being  developed  at  the  University  of 
Arizona.  PAMS  provides  adaptive  application 
management  services  to  dynamically  manage  the 
performance  and  fault  of  parallel/distributed 
applications  in  an  unreliable  and  heterogeneous 
computing  environment.  PAMS  implementation  is 
based  on  using  mobile  agents  that  can  be  programmed 
to  maintain  the  quality  of  service  requirements  of 


distributed  applications.  We  have  evaluated  three 
adaptive  techniques  to  manage  the  performance  and 
fault  tolerance  of  distributed  applications.  The  first 
approach  is  based  on  using  active  redundancy  to 
improve  performance  and  tolerate  faults.  The  second 
approach  is  based  on  passive  redundancy  in  which  a 
set  of  machines  is  desgnated  as  backup  machines  to 
be  used  to  replace  any  of  the  machines  assigned  to  the 
application  tasks  in  order  to  improve  performance  or 
to  tolerate  software/hardware  failures.  The  third 
approach  does  not  introduce  redundancy  in  the  system 
and  it  requires  task  migration  to  another  machine  in 
order  to  improve  performance  or  to  tolerate 
software/hardware  failures.  The  preliminary  results  of 
applying  these  techniques  demonstrate  that  our  agent- 
based  approach  can  lead  to  significant  gains  in  the 
performance  and  low  overhead  fault  management  of 
parallel/distributed  application.  The  organization  of 
the  paper  is  as  follows.  In  Section  2,  we  give  a  brief 
overview  of  the  PAMS  prototype.  In  Section  3,  we 
discuss  our  approach  to  benchmark  and  evaluate  the 
adaptive  performance  management  services  offered  by 
PAMS.  In  Section  4,  we  benchmark  and  evaluate  the 
adaptive  fault  management  service. 

2.  Architecture  of  the  Proactive 
Application  Management  System  (PAMS) 

The  architecture  of  PAMS  is  shown  in  Figure  1 . 
The  ACM  layer  provides  application  developers  with 
the  tools  required  to  specify  and  characterize  the 
application  requirements  in  terms  of  performance, 
fault,  security,  and  also  specify  the  appropriate 
management  scheme  to  maintain  the  application 
requirements.  Once  the  application  management 
requirements  are  defined  using  the  ACM  tools,  the 
next  step  is  to  utilize  the  management  services 
provided  by  the  Management  Computing  System 
(MCS)  to  build  the  appropriate  application  execution 
environment  that  can  dynamically  control  the  allocated 
resources  to  maintain  the  application  requirements 
during  the  application  execution.  The  MCS  assigns 
one  Application  Delegated  Manager  (ADM)  to 
manage  one  or  more  application  attributes 
(performance,  fault,  security,  etc.).  For  each  task  in  the 
application,  the  ADM  launches  an  appropriate  Task 
Agent  (TA)  to  monitor  and  manage  the  task  execution. 
The  TA  monitors  the  task  execution  using  appropriate 
task  sensors  and  intervenes  whenever  the  task 
execution  on  the  assigned  machine  can  not  ireet  its 
requirements  using  the  task  actuators  that  can  suspend, 
save  task  execution  state,  or  migrate  the  task  execution 
to  another  remote  machine.  Our  approach  supports 


54 


several  strategies  to  maintain  each  task  attribute.  For 
example,  to  manage  the  task  performance,  ADM  could 
use  active  redundancy,  passive  redundancy,  or  by 
migrating  the  task  execution  to  a  faster  machine  when 
the  assigned  machine  becomes  heavily  loaded.  The 
appropriate  management  scheme  can  be  selected  at 
runtime  depending  on  the  system  state  and  the  current 
available  resources  as  will  be  discussed  in  further 
detail  later. 

The  main  management  activities  of  TA  can  be 
abstracted  into  three  procedures  or  functions: 
Change„Detection,  Analsis„Verification,  and 
Adaptation_Plan.  The  Change_Detection  procedure  is 
responsible  for  detecting  the  conditions  in  which  the 
monitored  tasks  deviates  from  the  acceptable  behavior 
or  operation  (e.g.,  the  task  performance  degrades 
severely  due  to  bursty  traffic  conditions,  or  due  to 
software  or  hardware  failures).  The 
Analysis_Verification  algorithm  is  invoked  whenever 
a  change  is  detected  and  to  make  sure  that  the  change 
is  real  and  not  due  to  false  alarms.  Once  the  change 
event  is  verified  and  its  type  is  identified,  the 
Adaptation  Plan  procedure  is  invoked  to  execute  the 
appropriate  adaptation  scheme. 


Proactive_Application_Management  Algorithm 

1  For  each  Ap  Apje  ACM(Ap{), 

2  Assign  Application  Delegated  Manager  ADM 
(Api) 

3  Lunch  ADM  (Api) 

4  While  (AEE(Api)  is  running)  do 

5  For  each  Service  Sf€  APi 

6  S,e{Sft,  Sperf>  ^  security  j  Sconfjg  } 

7  Start  Service  Sj(Api), 

8  Monitor  Si(Api) 

9  EndFor 

10  EndWhile 

End  Proactive_AppIicationJManagement_Algorithm 
Figure  2  Proactive  Application  Management  Algorithm 


Figure  2  shows  the  general  Proactive  Application 
Management  Algorithm  for  the  PAMS  prototype.  The 
application  Execution  Environment  (AEE(Api))  refers 
to  all  the  resources  allocated  to  run  a  give  application 
Api  .  While  the  application  is  running  (step  4  in  the 
Proactive  Application  Management  Algorithm  of 
Figure  2),  the  ADM  starts  all  the  task  agents  required 
to  manage  the  application  requirements  (performance, 
security,  fault,  etc.)  (Step  7,8  in  the  algorithm  of 
Figure  2)  and  then  monitor  the  execution  of  that 
application  to  detect  any  changes  or  deterioration 
while  it  is  running.  In  what  follows,  we  discuss  PAMS 
approach  to  use  mobile  agents  to  manage  the 


performance  and  fault  tolerance  of  parallel/distributed 
applications. 

3.  Adaptive  Performance  Application 
Management 


(b)  Passive  Redunduncy 


(c)  Migration 


Figure  3  Controlling  Techniques  of  Performance 
Management 

Performance  management  for  distributed  systems 
is  complex  due  to  the  existence  of  many  components 
that  need  to  be  monitored  and  controlled.  Performance 
management  techniques  can  be  broadly  characterized 
into  two  schemes:  monitoring  and  controlling. 
Monitoring  is  the  function  that  tracks  the  performance 
activities  of  the  resources,  networks  and  their 
applications.  The  controlling  function  enables 
performance  management  to  make  adjustments  to 
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improve  performance.  We  need  algorithms  and 
techniques  to  derive  appropriate  performance  metrics 
[9]  [10],  and  resource  indicators  for  different  levels  of 
performance.  Adjusting  threshold  schemes  [13]  and 
polling  intervals  [14]  are  the  main  issues  in 
implementing  the  performance  monitoring  function. 
Performance  statistics  can  be  used  to  recognize 
potential  bottlenecks  or  failures  before  they  cause 
problems.  Five  major  prediction  models  for 
performance  predictions  for  parallel  or  distributed 
applications  are  discussed  in  [10].  With  performance 
prediction,  performance  management  schemes  can 
proactively  manage  large  and  complex  systems. 
Dynamic  load-balancing  [12]  and  process  migration 
[11]  have  also  been  studied  to  provide  appropriate 
performance  management. 

In  our  application  performance  management,  we 
monitor  the  execution  times  of  an  application  as  well 
as  the  resource  and  network  utilization.  In  addition,  we 
use  redundancy  techniques  and  task  migration  to 
implement  the  control  functions  required  to 
dynamically  manage  the  application  performance.  In 
this  paper,  we  evaluate  three  techniques  to  manage  the 
application  performance:  active  redundancy,  passive 
redundancy  and  migration.  Each  technique  is 
implemented  as  an  agent  template  as  shown  in  Figure 
3. 

The  active  redundancy  scheme  duplicates  the 
execution  of  the  application  on  two  machines  (see 
Figure  3  (a)).  In  this  scheme,  the  task  agent  will  pick 
up  the  results  from  the  first  machine  that  completes  the 
task  execution.  This  approach  has  several  advantages. 
First,  lead  to  better  performance  because  we  always 
pick  up  the  results  from  the  faster  machine.  Second,  it 
simplifies  the  performance  management  since  no  need 
to  perform  task  migration  or  load  balancing  in  the 
system  due  to  load  changes  or  bursty  traffic 
conditions. 


*  load  <5% 


load  <99%  and 
no  migration 

&  load  <99%  and 
migration 


Figure  4  Application  Execution  with  migration  scheme 
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The  passive  redundancy  assigns  each  task  to  a 
primary  machine  that  will  run  the  task  and  another 
machine  to  be  used  as  a  backup  whenever  the  task 
performance  deteriorates  on  the  assigned  machine  (see 
Figure  3  (b)).  The  backup  machine  is  kept-up-to-date 
in  order  to  be  ready  to  resume  the  task  execution  from 
the  last  updated  checkpoint.  The  main  advantage  of 
this  approach  is  that  it  needs  less  resources  than  the 
active  redundancy  approach.  In  this  scheme,  one 
backup  machine  can  be  used  as  a  backup  machine  to 
several  tasks. 

The  third  approach  does  not  introduce  redundancy 
and  improves  the  performance  by  task  migration  (see 
Figure  3  (c)).  However,  the  overhead  of  task  migration 
is  high  and  it  should  be  vsed  only  for  large  task 
granularities  where  the  migration  overhead  is 
relatively  small  when  compared  to  the  task  execution 
time. 
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« load  <5% 


*  load  <99%,  no 
redundancy 

P  load  <99%, 
active 

redundancy 

■  load  <99%, 
passive 
redundancy 


Figure  5  Application  Execution  with  Redundancy  policies 

We  benchmarked  the  overhead  associated  with 
implementing  PAMS  performance  management 
service  for  two  application  types:  a  small  application 
with  an  average  execution  time  of  30  seconds  and  a 
large  application  with  an  average  execution  time  of 
450  seconds.  We  evaluated  the  use  migration,  active 
redundancy  and  passive  redundancy  techniques  to 
dynamically  mange  the  performance  of  these  two 
applications.  If,  during  the  application  execution,  the 
load  on  a  machine  suddenly  increased  to  99%  CPU 
utilization,  the  migration  approach  was  able  to 
improve  the  performance  by  25%  for  the  small  size 
application  (approximately  40  seconds)  and  by  75% 
for  the  large  application  (approximately  308  seconds) 
as  shown  in  Figure  4.  The  active  redundancy 
technique  achieved  a  31%  performance  gain  for  the 
small  application  and  174%  for  the  large  application  as 
shown  in  Figure  5.  Similar  results  were  achieved  in 
the  passive  redundancy  approach,  where  a  22% 
performance  gain  was  achieved  for  the  small 
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application  and  a  114%  performance  gain  for  the  large 
application. 

4.  Adaptive  Fault  Tolerance 

The  main  goal  of  the  application  fault 
management  is  to  efficiently  recover  from 
hardware/software  failures  of  the  system  resources. 
Redundancy  is  an  important  technique  to  detect  and 
recover  from  component  failures  in  the  system.  The 
redundancy  can  be  in  the  form  of  hardware,  software, 
or  time  [15].  As  the  system  increases  its  complexity, 
more  sophisticated  techniques  are  needed  to  manage 
those  redundancies.  In  addition,  the  fault  management 
scheme  must  be  flexible  and  adaptive.  In  SCOP  [17],  a 
design  methodology  is  proposed  to  introduce  support 
techniques  to  reduce  the  resource  cost  of  fault -tolerant 
software,  both  in  space  and  time,  by  providing 
designers  with  a  flexible  redundancy  architecture  in 
which  dependability  and  efficiency  can  be  adjusted 
dynamically  at  run  time.  In  another  work  [18],  the  use 
of  mobile  agents  to  support  adaptive  fault  tolerance  is 
implemented.  In  our  adaptive  application  fault - 
tolerance  approach,  we  use  mobile  agents  to  efficiently 
manage  the  redundancy.  We  evaluate  two  redundancy 
techniques:  Passive  and  Active  redundancy. 


In  the  active  redundancy  technique  shown  in 
Figure  6,  we  assign  two  identical  tasks  to  two 
machines  that  are  managed  by  two  Task  Agents  (TAs); 
one  task  is  designated  as  the  primary  task  while  the 


second  one  is  referred  to  as  the  secondary  task.  In  this 
scenario,  the  ADM  doesn't  need  to  determine  the 
adaptation  plan  when  a  fault  occurs.  If  the  fault  occurs 
in  the  primary  task,  the  results  can  be  picked  up 
without  any  delay  from  the  secondary  task  that 
becomes  the  new  primary  task  once  its  task  agent 
detects  the  failure  in  the  primary  task  due  to  software 
or  hardware  failures.  In  addition  to  reducing  the  time 
for  fault  detection,  active  redundancy  technique 
simplifies  the  communication  between  task  agents. 
Figure  8  shows  the  overhead  incurred  by  applying  this 
redundancy  scheme  to  adaptively  manage  the  faults  of 
three  applications  with  three  tasks  each.  In  the  small 
application  case  (execution  time  is  around  60s),  the 
overhead  incurred  in  using  our  scheme  to  detect  and 
recover  from  one  task  failure,  two  task  failures,  and 
three  task  failures  are  0.10%,  0.18%,  and  0.22%, 
respectively  (see  Figure  7).  For  medium  and  large 
applications,  the  overhead  in  managing  one,  two  or 
three  task  failures  is  very  small  (less  than  0.02%). 
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Figure  7  The  overhead  of  Active  Redundancy  Technique 

The  second  approach  is  based  on  using  passive 
redundancy  in  managing  the  application  faults  (see 
Figure  8).  In  this  scenario,  we  assign  the  task  to  two 
machines:  one  is  designated  as  the  primary  machine 
while  the  second  machine  is  designated  as  the  backup 
machine.  The  backup  machine  does  not  run  the  task  as 
is  done  in  the  active  redundancy  case,  but  it  is  kept  up- 
to-date  about  the  task  execution  periodically  so  it  can 
resume  the  task  execution  from  the  last  checkpoint 
(update)  if  a  fault  occurred  in  the  primary  task. 
Furthermore,  the  backup  machine  could  be  assigned  as 
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a  backup  machine  for  more  than  one  task.  This 
improves  the  utilization  of  the  system  resources. 
Figure  9  shows  the  overhead  incurred  in  applying  this 
redundancy  technique  to  manage  the  faults  of  three 
applications.  For  a  small  application  with  three  tasks, 
the  overhead  incurred  to  manage  one  task  failure,  two 
task  failures,  and  three  task  failures  are  0.18%,  0.26%, 
and  0.42%.  For  a  medium  to  large  size  application,  the 
overhead  to  manage  one,  two  or  three  task  failures  is 
very  small  (less  than  0.02%). 

It  is  clear  from  the  experimental  results  that  our 
approach  is  very  efficient,  especially,  for  large 
parallel/distributed  applications.  Furthermore,  the  use 
of  mobile  agents  and  agent  templates,  we  can 
dynamically  select  the  appropriate  redundancy 
technique  at  runtime  depending  on  the  system  load  and 
number  of  available  resources. 


Figure  8  Passive  Redundancy  Techniques  for  Fault 
Management 


5.  Conclusion 

In  this  paper,  we  presented  our  approach  to  implement 
a  Proactive  Application  Management  System  (PAMS). 
The  PAMS  architecture  is  based  on  integrated 
management  framework  being  developed  at  the 
University  of  Arizona  [8].  The  experimental  results  of 
the  PAMS  management  services  to  manage  the 
performance  and  fault  tolerance  execution  of  three 
applications  of  different  sizes  (small,  medium  and 
large  demonstrate  that  our  agent-based  approach  can 


lead  to  significant  gains  in  performance  and  low 
overhead  in  fault  management.  We  are  currently 
implementing  additional  services  to  balance  the  load 
across  the  network  resources  and  maintain  the  system 
and  application  security  requirements. 
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Figure  9  The  overhead  of  Passive  Redundancy  Technique 
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Abstract 

An  emerging  model  for  computational  grids  intercon¬ 
nects  similar  multi-resource  servers  from  distributed  sites. 
A  job  submitted  to  the  grid  can  be  executed  by  any  of  the 
servers;  however,  resource  size  or  balance  may  be  differ¬ 
ent  across  servers.  One  approach  to  resource  management 
for  this  grid  is  to  layer  a  global  load  distribution  system  on 
top  of  the  local  job  management  systems  at  each  site.  Un¬ 
fortunately,  classical  load  distribution  policies  fail  on  two 
aspects  when  applied  to  a  multi- resource  server  grid.  First, 
simple  load  indices  may  not  recognize  that  a  resource  im¬ 
balance  exists  at  a  server.  Second,  classical  job  selection 
policies  do  not  actively  correct  such  a  resource  imbalanced 
state.  We  show  through  simulation  that  new  policies  based 
on  resource  balancing  perform  consistently  better  than  the 
classical  load  distribution  strategies. 


1.  Introduction 

An  emerging  model  in  high  performance  supercomput¬ 
ing  is  to  interconnect  similar  computing  systems  from  ge¬ 
ographically  remote  sites,  creating  a  near-homogeneous 
computational  grid  system.  Computing  systems,  or  servers, 
are  homogeneous  in  that  any  job  submitted  to  the  grid  may 
be  sent  to  any  server  for  execution.  However,  the  servers 
may  be  heterogeneous  with  respect  to  their  exact  resource 
configurations.  For  example,  the  first  phase  of  the  NASA 
Metacenter  linked  a  42-node  IBM  SP2  at  Langley  and  a 

*This  work  was  supported  by  NASA  grant  NCC2-5268  and  contract 
NAS2- 14303,  and  by  Army  High  Performance  Computing  Research  Cen¬ 
ter  (AHPCRC)  cooperative  agreement  DA AH04-95 -2-0003  and  contract 
DAAH04-95-C-0008.  Access  to  computing  facilities  was  provided  by  AH¬ 
PCRC,  Minnesota  Supercomputer  Institute. 


144-node  SP2  at  Ames  [7].  The  two  servers  were  homo¬ 
geneous  in  that  they  were  both  IBM  SP2s,  with  identical  or 
synchronized  software  configurations.  However,  they  were 
heterogeneous  on  two  counts:  the  number  of  nodes  in  each 
server,  and  the  fact  that  the  Langley  machine  consisted  of 
thin  nodes  while  the  Ames  machine  had  wide  nodes.  A  job 
could  be  executed  by  either  server  without  modifications, 
provided  a  sufficient  number  of  nodes  were  available  on  that 
server. 

The  resource  manager  for  the  near-homogeneous  grid 
system  is  responsible  for  scheduling  submitted  jobs  to  avail¬ 
able  resources  such  that  some  global  objective  is  satisfied, 
subject  to  the  constraints  imposed  by  the  local  policies  at 
each  site.  One  approach  to  resource  management  for  near- 
homogeneous  computational  grids  is  to  provide  a  global 
load  distribution  system  (LDS)  layered  on  top  of  the  local 
job  management  system  (JMS)  at  each  site.  This  architec¬ 
ture  is  depicted  in  Figure  1.  The  compute  server  at  each 
site  is  managed  by  a  local  JMS.  Users  submit  jobs  directly 
to  their  local  JMS  which  places  the  jobs  in  wait  queues  un¬ 
til  sufficient  resources  are  available  on  the  local  compute 
server.  The  global  LDS  monitors  the  load  at  each  site.  In 
the  event  that  some  sites  become  heavily  loaded  while  other 
sites  are  lightly  loaded,  the  LDS  attempts  to  equalize  the 
load  across  all  serves  by  moving  jobs  among  the  sites.  The 
JMS  at  each  site  is  then  responsible  for  the  detailed  allo¬ 
cation  and  scheduling  of  local  resources  to  jobs  submitted 
directly  to  it,  as  well  as  to  jobs  which  are  assigned  to  it  by 
the  global  LDS.  The  local  JMS  also  provides  load  status 
to  the  LDS  to  support  load  distribution  decisions,  as  well 
as  a  scheduling  Applications  Programming  Interface  (API) 
to  implement  these  decisions.  For  example,  in  the  NASA 
Metacenter,  a  peer-aware  receiver-initiated  load  balancing 
algorithm  was  used  to  move  work  from  one  IBM  SP2  to 
the  other.  When  the  workload  on  one  SP2  dropped  below 
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Figure  1.  Near-Homogeneous  Metacomputing  Resource  Management  Architecture 


a  specified  threshold,  the  peer-aware  load  balancing  mech¬ 
anism  would  query  the  other  SP2  to  see  if  it  had  any  work 
which  could  be  transferred  for  execution. 

The  architecture  depicted  in  Figure  1  is  conceptually 
identical  to  classical  load  balancing  in  a  parallel  or  dis¬ 
tributed  computer  with  two  notable  exceptions.  First,  the 
compute  server  at  each  site  may  be  a  complex  combina¬ 
tion  of  multiple  types  of  resources  (CPUS,  memory,  disks, 
switches,  and  so  on).  Similarly,  the  applications  submit¬ 
ted  by  the  users  are  described  by  multiple  resource  re¬ 
quirements.  We  generalize  these  notions  and  define  a 
if -resource  server  and  corresponding  if -requirement  job. 
Each  server  Si  has  if  resources,  Sf,  S}, ... ,  Sf~x.  Each 
job  Jj  is  described  by  its  requirements  for  each  resource 
type,  JjiJj,...,  jf~X-  Note  that  the  servers  are  still  con¬ 
sidered  homogeneous  from  the  jobs'  perspective,  as  any  job 
may  be  sent  to  any  server  for  execution. 

The  second  exception  is  that  the  physical  configura¬ 
tions  of  the  if  resources  for  each  server  may  be  heteroge¬ 


neous.  This  heterogeneity  can  be  manifested  in  two  ways. 
The  amount  of  a  given  resource  at  one  server  site  may  be 
quite  different  than  the  configuration  of  a  server  at  another 
site.  For  example,  server  Si  may  have  more  memory  than 
server  Sj .  Additionally,  servers  may  have  a  different  bal¬ 
ance  of  each  resource.  For  example,  one  server  may  have 
a  (relatively)  large  memory  with  respect  to  its  number  of 
CPUs  while  another  server  may  have  a  large  number  of 
CPUs  with  less  memory. 

Classical  load  balancing  attempts  to  maximize  system 
throughput  by  keeping  all  processors  busy.  We  extend  this 
notional  goal  to  fully  utilizing  all  if  resources  at  each  site. 
One  heuristic  for  achieving  this  objective  is  to  match  the  job 
mix  at  each  server  with  the  capabilities  of  that  server,  in  ad¬ 
dition  to  balancing  the  load  across  servers.  For  example,  if 
a  server  has  a  large  shared  memory,  then  the  job  mix  in  the 
local  wait  queue  should  be  adjusted  by  the  global  LDS  to 
contain  jobs  which  are  generally  memory  intensive.  Com¬ 
pute  intensive  jobs  should  be  moved  to  a  server  which  has 
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a  relatively  large  number  of  CPUs  with  respect  to  its  avail¬ 
able  memory.  The  goal  of  the  LDS  is  to  therefore  balance 
the  total  resource  demand  among  all  sites,  for  each  type  of 
resource. 

This  work  investigates  the  use  of  load  balancing  tech¬ 
niques  to  solve  the  global  load  distribution  problem  for 
computational  grids  consisting  of  near-homogeneous  multi- 
resource  servers.  The  complexity  of  multi-resource  com¬ 
pute  servers  along  with  the  multi-resource  requirements  of 
the  jobs  cause  the  methods  developed  in  past  load  balanc¬ 
ing  research  to  fail  in  at  least  two  areas.  First,  the  defini¬ 
tion  of  the  load  at  a  given  server  is  not  easily  described  by 
a  single  load  index.  Specifically,  a  resource  imbalance,  in 
which  the  local  job  mix  does  not  match  the  capabilities  of 
the  local  server,  is  not  directly  detectable.  This  impacts  the 
ability  of  the  global  LDS  to  match  the  workload  at  a  site 
to  the  capabilities  of  the  site.  We  propose  a  simple  exten¬ 
sion  to  a  classical  load  index  measure  based  on  a  resource 
balancing  heuristic  to  provide  this  additional  level  of  de¬ 
scriptive  detail.  Second,  once  a  resource  imbalance  is  de¬ 
tected,  existing  approaches  to  selecting  which  jobs  to  move 
between  servers  fail  to  actively  correct  the  problem.  We 
provide  an  analogous  job  selection  policy,  also  based  on  re¬ 
source  balancing,  which  heuristically  corrects  the  resource 
imbalance.  The  combination  of  these  two  extensions  pro¬ 
vides  the  framework  for  a  global  LDS  which  consistently 
outperforms  existing  approaches  over  a  wide  range  of  com¬ 
pute  server  characteristics. 

The  remainder  of  this  paper  is  organized  as  follows.  Sec¬ 
tion  2  provides  an  overview  of  relevant  past  research,  con¬ 
cluding  with  variants  of  a  baseline  load  balancing  algorithm 
drawn  from  the  literature.  Section  3  investigates  the  limi¬ 
tations  of  the  baseline  algorithms,  and  provides  extensions 
based  on  the  resource  balancing  heuristic.  A  description 
of  our  simulation  environment  is  given  in  Section  4.  The 
performance  results  of  our  new  load  balancing  methods  as 
compared  to  the  baseline  algorithms  is  also  summarized  in 
Section  4.  Finally,  Section  5  provides  conclusions  and  a 
brief  overview  of  our  current  work  in  progress. 

2.  Preliminaries 

Research  related  to  this  effort  is  drawn  from  single  server 
scheduling  in  the  presence  of  multiple  resource  require¬ 
ments  and  general  load  balancing  methods  for  homoge¬ 
neous  parallel  processing  systems. 

Recent  research  in  job  scheduling  for  a  single  server  has 
demonstrated  the  benefits  of  including  information  about 
the  memory  requirements  of  a  job  in  addition  to  its  CPU 
requirements  [13,  14].  The  generalized  A'-resource  sin¬ 
gle  server  scheduling  problem  was  studied  in  [10],  where 
it  was  shown  that  simple  backfill  algorithms  based  on 
multi-dimensional  packing  heuristics  consistently  outper¬ 


form  single-resource  algorithms,  with  increasing  K.  These 
efforts  all  suggest  that  the  local  JMS  at  each  site  should  be 
multi-resource  aware  in  making  its  scheduling  decisions. 
This  induces  requirements  on  the  global  LDS  to  provide  a 
job  mix  to  a  local  server  which  maximizes  the  success  rate 
of  the  local  server. 

The  general  goal  of  a  workload  distribution  system  is  to 
have  sufficient  work  available  to  every  computational  node 
to  enable  the  efficient  utilization  of  that  node.  A  central¬ 
ized  work  queue  provides  every  node  equal  access  to  all 
available  work,  and  is  generally  regarded  as  being  efficient 
in  achieving  this  goal.  Unfortunately,  the  centralized  work 
queue  is  generally  not  scalable  as  contention  for  the  sin¬ 
gle  queue  structure  increases  with  the  number  of  nodes.  In 
massively  parallel  processing  systems  where  the  number  of 
nodes  was  expected  to  reach  into  the  thousands,  this  was  a 
key  concern.  In  distributed  systems,  the  latency  for  query¬ 
ing  the  central  queue  potentially  increases  as  the  number  of 
nodes  is  increased.  Load  balancing  algorithms  attempt  to 
emulate  a  central  work  queue  by  maintaining  a  represen¬ 
tative  workload  across  a  set  of  distributed  queues,  one  per 
compute  node.  In  this  paper,  we  investigate  only  the  perfor¬ 
mance  of  load  balancing  across  distributed  queues. 

Classical  load  balancing  algorithms  are  typically  based 
on  a  load  index  which  provides  a  measure  of  the  workload 
at  a  node  relative  to  some  global  average,  and  four  policies 
which  govern  the  actions  taken  once  a  load  imbalance  is 
detected  [15].  The  load  index  is  used  to  detect  a  load  im¬ 
balance  state.  Qualitatively,  a  load  imbalance  occurs  when 
the  load  index  at  one  node  is  much  higher  (or  lower)  than 
the  load  index  on  the  other  nodes.  The  length  of  the  CPU 
queue  has  been  shown  to  provide  a  good  load  index  on  time- 
shared  workstations  when  the  performance  measure  of  in¬ 
terest  is  the  average  response  time  [2,  11].  In  the  case  of 
multiple  resources  (disk,  memory,  etc.),  a  linear  combina¬ 
tion  of  the  length  of  all  the  resource  queues  provided  an 
improved  measure,  as  job  execution  time  may  be  driven  by 
more  than  CPU  cycles  [5]. 

The  four  policies  that  govern  the  action  of  a  load  balanc¬ 
ing  algorithm  when  a  load  imbalance  is  detected  deal  with 
information,  transfer,  location,  and  selection.  The  informa¬ 
tion  policy  is  responsible  for  keeping  up-to-date  load  infor¬ 
mation  about  each  node  in  the  system.  A  global  information 
policy  provides  access  to  the  load  index  of  every  node,  at  the 
cost  of  additional  communication  for  maintaining  accurate 
information  [1]. 

The  transfer  policy  deals  with  the  dynamic  aspects  of  a 
system.  It  uses  the  nodes’  load  information  to  decide  when 
a  node  becomes  eligible  to  act  as  a  sender  (transfer  a  job 
to  another  node)  or  as  a  receiver  (retrieve  a  job  from  an¬ 
other  node).  Transfer  policies  are  typically  threshold  based. 
Thus,  if  the  load  at  a  node  increases  beyond  a  threshold  Ts, 
the  node  becomes  an  eligible  sender.  Likewise,  if  the  load 
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at  a  node  drops  below  a  threshold  Tr,  the  node  becomes  an 
eligible  receiver.  Load  balancing  algorithms  which  focus 
on  the  transfer  policy  are  described  in  [2,  15,  16]. 

The  location  policy  selects  a  partner  node  for  a  job  trans¬ 
fer  transaction.  If  the  node  is  an  eligible  sender,  the  location 
policy  seeks  out  a  receiver  node  to  receive  the  job  selected 
by  the  selection  policy  (described  below).  If  the  node  is 
an  eligible  receiver,  the  location  policy  looks  for  an  eligible 
sender  node.  Load  balancing  approaches  which  focus  on 
the  use  of  the  location  policy  are  described  in  [8,  9]. 

Once  a  node  becomes  an  eligible  sender,  a  selection  pol¬ 
icy  is  used  to  pick  which  of  the  queued  jobs  is  to  be  trans¬ 
ferred  to  the  receiver  node.  The  selection  policy  uses  several 
criteria  to  evaluate  the  queued  jobs.  Its  goal  is  to  select  a  job 
that  reduces  the  local  load,  incurs  as  little  cost  as  possible 
in  the  transfer,  and  has  good  affinity  to  the  node  to  which 
it  is  transferred.  A  common  selection  policy  is  latest-job- 
arrived  which  selects  the  job  which  is  currently  in  last  place 
in  the  work  queue. 

The  primary  difference  between  existing  load  balancing 
algorithms  and  our  global  load  distribution  requirements  is 
that  our  node  is  actually  a  multi-resource  server.  With  this 
extension  in  mind,  we  define  the  following  baseline  load 
balancing  algorithm: 

•  Load  Index.  The  load  index  is  based  on  the  average 
resource  requirements  of  the  jobs  waiting  in  the  queue 
at  a  given  server.  This  index  is  termed  the  resource 
average  (RA)  index.  For  our  multi-resource  server  for¬ 
mulation,  each  resource  requirement  for  a  job  in  the 
queue  represents  a  percentage  of  the  server  resource 
that  it  requires,  normalized  to  unity.  Therefore,  the  RA 
index  is  a  relative  index  which  can  be  used  to  compare 
the  loads  on  different  servers. 

•  Information  Policy.  As  the  information  policy  is  not 
the  subject  of  this  study,  we  choose  to  use  a  policy 
which  provides  perfect  information  about  the  state  of 
the  global  system.  We  assume  a  global  information 
policy  with  instantaneous  update. 

•  Transfer  Policy.  The  transfer  policy  is  threshold  based, 
since  it  has  been  shown  to  provide  robust  performance 
across  a  range  of  load  conditions.  A  server  becomes 
a  sender  when  its  load  index  grows  above  the  global 
load  average  by  a  threshold,  Ts .  Conversely,  a  server 
becomes  a  receiver  when  its  load  index  falls  below  the 
global  average  by  a  threshold  Tr. 

•  Location  Policy.  The  location  policy  is  also  not  the 
subject  of  this  study.  Therefore,  we  use  a  simple  lo¬ 
cation  policy  which  heuristically  results  in  fast  con¬ 
vergence  to  a  balanced  load  state.  In  the  event  that 
the  transfer  policy  indicates  that  a  server  becomes  a 


sender,  the  location  policy  selects  the  server  which  cur¬ 
rently  has  the  least  load  to  be  the  receiver.  However, 
the  selected  server  must  also  be  an  eligible  receiver, 
meaning  that  it  currently  has  a  load  which  is  Tr  below 
the  global  average.  Conversely,  if  the  server  is  a  re¬ 
ceiver,  the  location  policy  selects  the  server  which  cur¬ 
rently  has  the  highest  load  that  is  Ts  above  the  global 
average.  If  no  eligible  partner  is  found,  the  load  bal¬ 
ancing  action  is  terminated. 

•  Selection  Policy.  A  latest-job-arrived  selection  policy 
(LSP)  is  used  to  select  a  job  from  the  sending  server 
to  be  transferred  to  the  receiving  server.  This  selec¬ 
tion  policy  generally  performs  well  with  respect  to 
achieving  a  good  average  response  time,  but  suffers 
from  some  jobs  being  moved  excessively.  Therefore, 
each  job  keeps  a  job  transfer  count  which  records  the 
number  of  times  it  has  been  moved.  When  this  count 
reaches  a  threshold  Tc,  the  job  is  no  longer  eligible  to 
be  selected  for  a  transfer.  Jobs  which  are  already  exe¬ 
cuting  are  excluded  from  being  transferred. 

The  sender  initiated  (SI),  receiver  initiated  (RI),  and 
symmetrically  initiated  (SY)  algorithm  variants  are  gener¬ 
ated  using  a  transfer  policy  which  triggers  a  load  balancing 
action  on  Ts,Tr,  or  both,  respectively.  All  baseline  variants 
use  the  RA  load  index  and  the  LSP  job  selection  policy. 

3.  Multi-Resource  Aware  Load  Balancing  Poli¬ 
cies 

In  this  section,  we  first  discuss  the  limitations  of  the  re¬ 
source  average  load  index,  RA,  and  the  latest-job-arrived 
selection  policy,  LSP,  of  the  baseline  load  balancing  algo¬ 
rithms  for  the  heterogeneous  multi-resource  servers  prob¬ 
lem.  We  provide  an  example  which  illustrates  where  these 
naive  strategies  can  fail  to  match  the  workload  to  the 
servers,  resulting  in  local  workloads  which  exhibit  a  re¬ 
source  imbalance.  We  then  provide  extensions  to  the  load 
index  and  the  job  selection  policy  which  strive  to  balance 
the  resource  usage  at  each  server. 

3.1.  Limitations  of^RA  and  LSP 

The  resource  average  load  index,  RA,  and  the  latest-job- 
arrived  job  selection  policy,  LSP,  in  the  baseline  algorithm 
fail  in  the  multi-resource  server  load  balancing  context.  The 
following  discussion  gives  an  example  of  these  failures  and 
provides  some  insight  into  possible  new  methods.  Our  new 
methods  will  be  further  discussed  in  Section  3.2. 

In  past  research,  the  index  used  to  measure  the  load  on 
a  server  with  respect  to  multiple  resources  consisted  of  a 


63 


linear  combination  or  an  average  of  the  resource  require¬ 
ments  for  the  actively  running  jobs  in  a  time-shared  sys¬ 
tem.  A  corresponding  index  which  may  be  applied  to  batch 
queued  space-shared  systems  is  to  use  the  average  of  the 
total  resource  requirements  of  the  jobs  waiting  in  the  wait 
queue.  However,  this  may  not  always  indicate  a  system  state 
where  there  exists  a  resource  imbalance,  that  is,  the  total 
job  requirements  for  one  resource  exceeds  the  requirements 
for  the  other  resources.  Essentially,  a  server  with  a  mis¬ 
matched  work  mix  will  be  forced  to  leave  some  resources 
idle  while  other  resources  are  fully  utilized,  resulting  in  an 
inefficient  use  of  the  system  as  a  whole. 

Figure  2(a)  depicts  the  state  of  the  job  ready  queues, 
RQo  and  RQ\  for  a  two-server  system,  So  and  Si.  As¬ 
sume  that  each  server  has  three  resources,  Sf ,  S} ,  and  Sf , 
and  that  the  configuration  for  the  two  servers  is  identical, 
S°  =  50  ^  51  _  and  S02  =  Sf.  Each  of  the  two  ready 
queues  currently  has  two  jobs.  The  job  which  arrived  lat¬ 
est  at  each  server  is  on  the  top  of  the  ready  queue  for  that 
server.  For  example,  the  latest  arriving  job  ,  J/,,  in  RQo  has 
the  resource  requirements  =  2,  J\  —  3,  and  J\  =  2. 
Note  that  the  resource  requirements  for  a  job  are  given  as 
a  percentage  of  the  total  available  in  the  server.  The  total 
workload  for  each  resource,  k ,  in  a  given  server,  Si,  is  de¬ 
noted  as 

Wtk=  (Ji)’  °<i<S >  0  <k<K. 

Jj€RQi 

The  resource  average  load  index  for  a  given  server,  Si,  is 
then  given  by 

RAi  =  AvgiW?),  0  <  k  <  K. 

In  this  example,  K  =  3  and  RAo  =  RA\  —  4. 

The  third  queue  in  Figure  2(a),  RQAvg>  represents  the 
global  average  workload  for  each  resource  in  RQo  and 
RQ\.  The  global  average  workload  for  resource  k,  is  then 
given  by 

W%vg  =  Avg(W?),  0  <i<S. 

Here,  S  =  2  and  W°Avg  =  W\vg  =  W\vg  =  4,  meaning 
that  on  average,  each  RQ,  has  a  total  requirement  of  4  per¬ 
cent  for  each  resource.  The  global  resource  average  load 
index  is  simply 

RA  =  Avg(W\vg),  0  <k<K, 

which  in  this  example  is  RA  =  4.  Server  Si  is  defined  to  be 
in  a  load  balanced  state  as  long  as  RA  *  (1  —  Tx)  <  RAi  < 
RA  *  (1  +  Tx),  where  Tx  is  the  transfer  policy  threshold,  as 
defined  in  Section  2.  Since  RAo  =  RA\  =  RA,  the  system 
is  believed  to  be  in  a  load  balanced  state. 

Even  though  the  RA  index  indicates  a  balanced  load,  it  is 
clear  from  Figure  2(a)  that  the  job  mix  in  RQo  has  a  higher 


requirement  for  resource  Sq  than  for  resources  5°  and  Sq. 
The  result  is  that  50  will  probably  be  unable  to  fully  utilize 
resources  S°  and  Sq  as  resource  Sq  becomes  the  bottleneck. 
Conversely,  the  job  mix  in  RQx  has  a  higher  requirement 
for  resources  S,  and  .5'^  than  for  5},  resulting  in  an  ineffi¬ 
cient  use  of  resource  5} .  Therefore,  the  workload  at  each 
server  suffers  from  a  resource  imbalance. 

In  order  to  detect  this  problem,  we  define  a  second  load 
index,  called  resource  balance  (RB),  which  measures  the 
resource  imbalance  at  a  given  server  or  globally  across  the 
system.  Namely,  for  server  Si,0  <i  <  S, 


Max{W f)  _  Max{Wk) 

Avg{Wk)  ~  RAi 


0  <  k  <  K. 


Similarly, 


Max(WAvg)  _  Max{WkAvg)_ 

Avg{WkAvg)  ~  RAa  ’ 


Heuristically,  the  RB  index  of  a  server  measures  how  bal¬ 
anced  the  job  mix  is  with  respect  to  their  different  re¬ 
source  requirements.  If  the  total  resource  requirements  are 
all  the  same,  then  the  local  RB  measure  is  unity,  since 
Max(Wk)  =  Avg(W^)  .  This  corresponds  to  the  case 
where  the  workload  is  matched  to  the  server.  The  global 
RB  is  a  measure  of  how  well  the  total  work  in  the  system 
matches  the  capabilities  of  all  the  servers  in  the  system.  The 
goal  of  the  load  balancing  algorithm  is  to  move  each  server 
towards  this  global  balanced  resource  level.  In  Figure  2(a), 
RB0  =  6/4  or  1.5,  while  RBX  =  5/4  or  1.25.  Since 
RB  =  4/4  or  1.0,  the  two  servers  recognize  the  existence 
of  a  resource  imbalanced  state. 

Once  a  resource  imbalance  is  detected,  the  load  bal¬ 
ancing  policies  must  actively  correct  the  imbalance.  Fig¬ 
ure  2(b)  shows  the  result  of  using  the  LSP  policy  to  ad¬ 
just  the  resource  imbalance.  Server  So  sends  its  latest 
job  to  Si,  while  Si  sends  its  latest  job  to  So.  Note  that 
the  resource  balance  index  improves  on  both  servers,  with 
RB0  =  4/3.33  or  1.2,  while  RBi  =  5/4.66  or  1.07.  How¬ 
ever,  the  resource  balance  could  have  been  improved  even 
further,  as  shown  in  Figure  2(c),  by  transferring  the  jobs 
which  best  balance  the  workload  at  both  servers.  We  refer 
to  this  heuristic  policy  as  the  balanced  job  selection  policy 
or  BSP. 


3.2.  Resource  Balancing  Algorithms 

In  the  following  discussion,  we  extend  the  baseline  load 
balancing  algorithm  with  the  heuristic  RB  load  index  and 
the  BSP  job  selection  policy.  In  general,  the  goal  of  these 
extensions  is  to  move  the  system  to  a  state  where  the  load  is 
balanced  across  the  servers  and  the  job  mix  at  each  server 
matches  the  resource  capabilities  provided  by  that  server. 
These  extensions  are  described  below. 
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(a)  Comparison  of  RA  and  RB  Load  Index  Measures 


(c)  Result  of  Balance  Job  Selection  Policy  (BSP) 

Figure  2.  Limitations  of  RA  and  LSP 


Sender  Initiated,  Balanced  Selection  Policy:  SI-BSP. 

The  baseline  sender  initiated  algorithm,  SI,  is  extended  to 
SI-BSP  by  modifying  the  selection  policy  as  follows.  The 
fact  that  the  load  balancing  action  was  triggered  by  the  con¬ 
dition  that  the  load  index,  RA,  of  a  given  server  was  above 
the  global  average  implies  that  it  has  more  work  than  at  least 
one  other  server.  Thus,  this  heavily  loaded  server  needs  to 
transfer  work  to  another  server.  The  BSP  policy  selects  the 
job  for  transfer  (out)  which  results  in  the  best  resource  bal¬ 
ance  of  the  local  queue.  Note  that  transferring  a  job  may 
actually  worsen  the  resource  imbalance,  but  we  proceed 
nonetheless  so  that  the  overall  excess  workload  can  be  re¬ 
duced.  Also,  the  resource  balance  at  the  receiving  server 
may  worsen  as  well.  However,  the  receiving  server  cur¬ 
rently  has  a  workload  shortage,  so  it  may  be  executing  less 
efficiently  anyway. 

Sender  Initiated,  RB  Index,  Balanced  Selection  Pol¬ 
icy:  SI_RB_BSP.  The  SLRB-BSP  algorithm  extends  the 
SI-BSP  algorithm  by  including  the  RB  load  index,  and  mod¬ 
ifying  the  transfer  and  selection  policies  as  follows.  First, 
the  transfer  policy  triggers  a  load  balancing  action  based  on 
RA  or  RB.  If  the  action  is  based  on  RA,  SLRB-BSP  pro¬ 
ceeds  as  SI_BSP.  However,  if  the  action  is  based  only  on 
RB,  the  selection  policy  is  further  modified  over  that  used 
for  SIJBSP.  The  job  which  positively  improves  the  resource 
balance  of  the  local  queue  the  most  is  selected  for  transfer 
(out).  If  no  such  job  is  found,  no  action  occurs. 

Receiver  Initiated,  Balanced  Selection  Policy:  RI-BSP. 

The  baseline  receiver  initiated  algorithm,  RI,  is  extended  to 
RI_BSP  in  a  fashion  complementary  to  SI_BSP. 

Receiver  Initiated,  RB  Index,  Balanced  Selection  Pol¬ 
icy:  RIJRB-BSP.  The  RI_RBJBSP  algorithm  extends 
the  RI.BSP  algorithm  in  a  fashion  complementary  to 
SI-RB-BSP. 

Symmetrically  Initiated,  Balanced  Selection  Policy: 
SYJBSP.  The  baseline  symmetrically  initiated  algorithm, 
SY,  is  extended  to  SYJ3SP  as  follows.  If  the  transfer  pol¬ 
icy  triggers  a  send  action,  SY.BSP  proceeds  as  SIJ3SP.  Al¬ 
ternatively,  if  the  transfer  policy  triggers  a  receive  action, 
SY_BSP  proceeds  as  RLBSP. 

Symmetrically  Initiated,  RB  Index,  Balanced  Selection 
Policy:  SY-RB-BSP.  The  SY-RB.BSP  algorithm  ex¬ 
tends  the  SY_BSP  algorithm  as  follows.  If  the  action  is 
based  on  RA,  SY_RB_BSP  proceeds  as  SYJBSP.  However, 
if  the  action  is  based  only  on  RB,  then  SY JR.B.BSP  per¬ 
forms  both  send  and  receive  actions  using  methods  identi¬ 


cal  to  SI  JIB  _BSP  and  RLRBJBSR  Heuristically,  this  main¬ 
tains  the  RA  index  but  improves  the  RB  index. 

4.  Experimental  Results 

The  baseline  and  extended  load  balancing  algorithms 
were  implemented  on  a  simulated  system  that  is  described 
in  Section  4.1.  The  experimental  results  are  summarized  in 
Section  4.2. 

4.1.  System  Model 

The  simulation  system  follows  the  general  form  of  Fig¬ 
ure  1 .  The  server  model,  workload  model,  and  performance 
metrics  are  discussed  below. 

Server  Model.  A  system  with  1 6  servers  was  used  for  the 
current  set  of  experiments.  A  server  model  is  specified  by 
the  amount  of  each  of  the  K  resource  types  it  contains  and 
the  choice  of  the  local  scheduler.  For  all  simulations,  the  lo¬ 
cal  scheduler  uses  a  backfill  algorithm  with  a  resource  bal¬ 
ancing  job  selection  criteria  [10].  To  our  knowledge,  this 
is  the  best  performing  local  scheduling  algorithm  for  the 
multi-resource  single  server  problem.  At  this  point,  we  as¬ 
sume  that  the  jobs  are  rigid ,  meaning  that  they  must  receive 
the  required  resources  before  they  can  execute.  We  also 
assume  that  the  execution  time  of  a  job  is  the  same  on  any 
server.  Simulation  results  are  reported  for  a  value  of  K  =  8. 

Two  independent  parameters  were  used  to  specify  the 
degree  of  heterogeneity  across  the  servers  in  the  simulated 
system.  First,  within  a  single  server,  the  server  resource 
correlation ,  Src ,  parameter  specifies  how  the  resources  of  a 
given  server  are  balanced.  This  represents  the  intra-server 
resource  heterogeneity  measure.  For  example,  assume  each 
server  has  two  resources,  CPUs  and  memory.  If  a  cor¬ 
relation  value  of  about  one  were  specified,  then  a  server 
with  a  large  memory  would  also  have  a  large  number  of 
CPUs.  Conversely,  if  a  correlation  value  of  about  negative 
one  were  used,  then  a  server  with  a  large  memory  would 
probably  have  a  low  number  of  CPUs.  Finally,  a  correla¬ 
tion  value  near  zero  implies  that  the  resource  sizes  within  a 
given  server  are  unrelated.  The  value  of  the  resource  cor¬ 
relation  ranged  from  0.15  to  0.85  in  the  simulations  (our 
simulator  is  capable  of  generating  Src  values  in  the  range 

-1.0/(tf-  1)  <  Src  <  1.0). 

The  second  parameter  is  the  server  resource  variance , 
5ru,  which  is  used  to  describe  range  of  sizes  for  a  single 
resource  which  may  be  found  across  all  of  the  servers.  This 
represents  the  inter-server  heterogeneity  measure.  Again, 
a  resource  variance  about  one  implies  that  the  number  of 
CPUs  found  in  server  S';  will  be  approximately  the  same  as 
the  number  of  CPUs  found  in  server  Sj  for  0  <  i,  j  <  S. 
In  general,  a  resource  variance  of  Srv  =  V  implies  that 
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the  server  Si  with  the  largest  amount  of  a  resource  k  has 
V  times  as  much  of  that  resource  as  the  server  Sj  which 
has  the  smallest  amount  of  that  resource.  All  other  servers 
have  some  amount  of  resource  k  between  S?  and  5?.  The 

*  J 

value  of  the  resource  variance  ranged  from  1 .2  to  8.0  for  our 
experiments. 

Workload  Model.  The  two  main  aspects  of  the  simulated 
workload  are  the  generation  of  multi-resource  jobs  and  the 
job  arrival  rate.  Recent  studies  on  workload  models  have  fo¬ 
cused  primarily  on  a  single  resource  —  the  number  of  CPUs 
that  a  job  requires.  Two  general  results  from  these  studies 
show  that  the  distribution  of  CPU  requirements  is  gener¬ 
ally  hyperexponential,  but  with  strong  discrete  components 
at  powers  of  two  and  squares  of  integers  [3].  An  addi¬ 
tional  study  investigated  the  distribution  of  memory  require¬ 
ments  on  the  1024  processor  CM-5  at  Los  Alamos  National 
Laboratory.  The  conclusion  was  that  memory  requirements 
are  also  hyperexponentially  distributed  with  strong  discrete 
components.  Additionally,  there  was  a  weak  correlation  be¬ 
tween  the  CPU  and  memory  requirements  for  the  job  stream 
studied  [4]. 

We  generalize  these  results  to  a  AT-resource  workload  as 
follows.  The  multiple  resource  requirements  for  a  job  in 
the  job  stream  are  described  by  two  parameters.  The  kth 
resource  requirement  for  job  j ,  J*\  is  drawn  from  a  hyper¬ 
exponential  distribution  with  mean  X*.  Additionally,  the 
correlation  between  resource  requirements  within  a  single 
job,  Jrc  is  also  specified.  A  single  set  of  workload  parame¬ 
ters  was  used  for  all  of  the  initial  simulations  reported  here, 
in  which  X *  =  0.15,0  <  k  <  K,  and  the  resource  cor¬ 
relation  Jrc  =  0.25.  Essentially,  the  average  job  requires 
15%  of  each  resource  in  an  average  server,  and  its  relative 
resource  requirements  are  near  random. 

Figure  3(a)  shows  the  single  resource  probability  distri¬ 
bution  used  for  the  workload.  Note  that  the  probability  for 
small  resource  requirements  is  reduced  over  a  strictly  expo¬ 
nential  distribution.  We  justify  this  modification  by  noting 
that  small  jobs  are  probably  not  good  candidates  for  load 
balancing  activity  as  they  do  not  impact  the  local  job  sched¬ 
uler  efficiency  significantly  (except  to  improve  it).  Fig¬ 
ure  3(b)  shows  the  joint  probability  distribution  for  a  dual 
resource  ( K  =  2)  system.  In  general,  the  joint  probability 
distribution  shown  in  Figure  3(b)  is  nearly  identical  for  all 
pairs  <  i,  j  <  K,  of  resources  in  the  job  stream. 

This  workload  model  has  also  been  used  to  study  multi¬ 
resource  scheduling  on  a  single  server  [10]. 

The  job  arrival  rate  generally  affects  the  total  load  on  the 
system.  A  high  arrival  rate  results  in  a  large  number  of  jobs 
being  queued  at  each  server,  while  a  low  arrival  rate  reduces 
the  number  of  queued  jobs.  For  our  initial  simulations,  we 
selected  an  arrival  rate  that  resulted  in  an  average  of  32  jobs 
per  server  in  the  system.  As  each  job  arrives,  it  is  sent  to  a 


server  selected  randomly  from  a  uniform  distribution  rang¬ 
ing  from  0  to  S'  -  1.  A  final  assumption  is  that  the  nature 
of  the  workload  model  impacts  only  the  absolute  values  of 
the  system  performance,  not  the  relative  performance  of  the 
algorithms  under  study. 

Performance  Metrics.  A  single  performance  metric, 
throughput ,  is  our  current  method  for  evaluating  these  al¬ 
gorithms.  Throughput  is  measured  as  the  total  elapsed  time 
from  when  the  first  job  arrives  to  when  the  last  job  departs. 

4.2.  Simulation  Results 

Our  initial  simulation  results  are  depicted  in  Figures 
4(a)— (f).  Recall  that  load  balancing  algorithms  essentially 
try  to  mimic  a  central  work  queue  from  which  any  server 
can  select  jobs  as  its  resources  become  available.  Therefore, 
the  performance  results  for  the  load  balancing  algorithms 
are  normalized  against  the  results  of  a  system  with  a  central 
work  queue.  For  each  graph  in  the  figure,  the  x  axis  rep¬ 
resents  the  server  resource  variance  parameter,  Srv ,  as  de¬ 
scribed  previously,  while  the  y  axis  represents  the  through¬ 
put  of  the  algorithms,  normalized  to  the  throughput  of  the 
central  queue  algorithm.  The  following  paragraphs  summa¬ 
rize  these  results. 

Impact  of  the  Resource  Balancing  Policies.  Figures 
4(a>-(c)  depict  the  performance  of  the  sender  initiated,  re¬ 
ceiver  initiated,  and  symmetrically  initiated  baseline  and 
extended  algorithms,  normalized  to  the  performance  of  the 
central  queue  algorithm.  For  these  experiments,  K  =  8  and 
Src  =  0.50  (resources  within  a  server  are  very  weakly  cor¬ 
related).  In  comparing  the  performance  of  the  baseline  and 
the  extended  algorithms,  we  see  that  the  extended  variants 
consistently  out-perform  the  baseline  algorithm  from  which 
they  were  derived.  The  addition  of  the  intelligent  job  se¬ 
lection  policy,  BSP,  provides  a  5-10%  gain  in  the  SIJBSP, 
RLBSP,  and  SY_BSP  algorithms  over  the  SI,  RI,  and  SY 
algorithms,  respectively.  Moreover,  the  addition  of  the  RB 
load  index  and  associated  transfer  policy  further  increases 
these  gains  for  SIJRB.BSP,  RIJRB-BSP,  and  SY_RB_BSP. 

Effects  of  Server  Resource  Correlation,  Src.  The  jobs 
which  arrive  at  each  ‘server  may  or  may  not  have  a  natu¬ 
ral  affinity  for  that  server.  For  example,  if  a  server  has  a 
large  memory  and  a  few  CPUs,  a  job  which  is  memory  in¬ 
tensive  has  a  high  affinity  for  that  server.  However,  a  job 
which  is  CPU  intensive  has  a  low  affinity  to  that  server. 
For  a  job  stream  with  a  fixed  intra-job  resource  correla¬ 
tion,  Jrc,  the  probability  that  an  arriving  job  has  good  affin¬ 
ity  to  a  server  increases  as  Src  increases.  A  larger  natural 
affinity  increases  the  packing  efficiency  of  the  local  sched¬ 
ulers,  improving  the  throughput.  Figures  4(d)— (f)  compare 
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Figure  3.  Multi-Resource  Workload  Model 


the  performance  of  the  RI_RB_BSP,  SI_RB_BSP,  and  the 
SYJIBJBSP  algorithms,  over  the  range  of  server  resource 
correlation  values,  Src  —  {0.15,0.50,0.70}.  Generally,  as 
the  value  of  Src  increases,  the  performance  of  the  load  bal¬ 
ancing  algorithms  also  improve,  due  to  an  increased  proba¬ 
bility  of  natural  affinity. 

The  SI_RB_BSP  algorithm  performs  slightly  better  than 
RIJRB_BSP  at  low  values  of  Src  as  seen  in  Figure  4(d). 
However,  RI_RB_BSP  begins  to  outperform  SIJRB_BSP  at 
higher  values  of  5rc,  as  seen  in  Figures  4(e)  and  4(f).  At  low 
values  of  5rc,  the  SI  variant  can  actively  transfer  out  jobs 
with  low  affinity,  which  occur  with  high  probability,  while 
the  RI  variant  can  only  try  to  correct  the  affinity  of  their  to¬ 
tal  workload.  Higher  values  of  Srv  magnify  this  problem. 
Therefore,  the  performance  advantage  goes  to  the  SI  vari¬ 
ant.  For  higher  values  of  5rc,  the  probability  of  good  job- 
server  affinity  is  also  higher.  When  accompanied  by  higher 
Srv ,  the  system  may  be  thought  of  as  having  some  larger 
servers  and  some  smaller  servers,  with  good  job  affinity  to 
any  server.  In  this  case,  the  throughput  of  the  system  is 
driven  by  the  efficiency  of  the  larger  servers.  In  the  SI  vari¬ 
ant,  the  smaller  servers  will  tend  to  initiate  load  balancing 
actions,  by  sending  work  to  the  larger  servers.  So  while  the 
smaller  servers  may  execute  efficiently,  the  larger  servers 
may  not.  However,  in  the  RI  variant,  the  larger  servers  will 
tend  to  initiate  load  balancing,  and  intelligently  select  which 
work  to  receive  from  the  smaller  servers.  Now,  the  larger 
servers  will  tend  to  execute  more  efficiently.  For  this  rea¬ 
son,  the  performance  advantage  goes  to  the  RI  variant. 

Impact  of  Server  Resource  Variation,  Srv .  As  the  re¬ 
source  variation,  SrV9  increases  in  the  graphs  of  Figure  4, 
the  throughput  of  the  load  balancing  algorithms  drops  rela¬ 
tive  to  the  central  queue  algorithm.  This  is  due  to  the  fact 


that  the  average  job  size  (size  of  the  jobs  resource  require¬ 
ments)  is  not  taken  into  account  when  selecting  jobs  for 
transfer.  At  higher  server  resource  variances,  some  servers 
have  a  very  small  amount  of  one  or  more  resources.  How¬ 
ever,  the  average  job  size  ending  up  on  the  servers  with 
small  resource  capacities  is  the  same  as  those  ending  up 
on  the  larger  servers.  The  small  size  of  the  resources  in 
these  servers,  relative  to  the  average  resource  requirement 
of  the  arriving  jobs,  causes  packing  inefficiencies  by  the  lo¬ 
cal  scheduler,  due  to  job  size  granularity .  In  the  case  of  a 
centralized  queue,  the  servers  with  small  resource  capacities 
are  more  likely  to  find  jobs  with  smaller  resource  require¬ 
ments.  In  short,  simply  balancing  the  workload  resource 
characteristics  is  not  sufficient.  Other  workload  character¬ 
istics  must  also  be  emulated  in  the  local  queues,  such  as  the 
average  job  requirements  relative  to  the  server  resource  ca¬ 
pacities.  This  is  a  topic  in  our  current  work  in  progress  and 
is  briefly  discussed  in  Section  5. 


Central  Queue  vs.  Load  Balancing.  A  final  observation 
may  be  drawn  from  the  graphs  in  Figure  4.  Even  when 
the  servers  are  all  similarly  configured  (e.g.  Srv  ~  1  and 
Src  ~  1),  there  is  a  consistent  performance  gap  of  15%  for 
all  baseline  and  extended  load  balancing  algorithms  with  re¬ 
spect  to  the  central  queue  algorithm.  This  is  due  to  the  fact 
that  even  if  the  load  balancing  algorithms  are  successful  in 
balancing  the  load,  the  local  scheduler  at  each  server  may 
not  be  able  to  find  a  job  in  its  local  queue  to  fill  idle  re¬ 
sources,  even  if  such  a  job  exists  in  the  queue  of  a  different 
server.  Closing  this  gap  is  the  subject  of  our  current  work 
and  is  briefly  discussed  in  Section  5. 
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Figure  4.  Baseline  and  Extended  Algorithm  Performance  Comparison 


69 


5.  Summary  and  Work  in  Progress 

In  this  paper,  we  defined  a  workload  distribution  prob¬ 
lem  for  a  computational  grid  with  near-homogeneous  multi¬ 
resource  servers.  First,  servers  in  the  grid  have  multiple 
resource  capacities,  and  jobs  submitted  to  the  grid  have  re¬ 
quirements  for  each  of  those  resources.  Additionally,  the 
servers  are  homogeneous  in  that  any  job  submitted  to  the 
grid  can  be  executed  by  any  of  the  servers,  but  heteroge¬ 
neous  in  their  various  resource  configurations.  We  next 
investigated  a  load  balancing  approach  to  workload  distri¬ 
bution  for  this  grid.  We  showed  how  previous  baseline 
load  balancing  policies  for  single  resource  systems  failed 
to  maintain  a  workload  at  each  server  which  had  a  good 
affinity  towards  that  server.  We  then  proposed  two  orthog¬ 
onal  extensions  based  on  the  concept  of  resource  balanc¬ 
ing.  The  basic  idea  of  resource  balancing  is  that  the  local 
scheduler  is  more  effective  in  utilizing  the  resources  of  the 
local  server,  if  the  total  relative  resource  requirements  of 
all  jobs  in  a  local  work  queue  match  the  relative  capacities 
of  the  server.  Our  simulation  results  show  that  our  policy 
extensions  provided  a  consistent  5-15%  increase  in  system 
throughput  performance  over  the  baseline  load  balancing  al¬ 
gorithms. 

However,  there  is  still  significant  room  for  improvement 
in  the  workload  distribution  approach.  First,  as  the  re¬ 
source  variance  between  servers  grows,  additional  work¬ 
load  characteristics,  beyond  the  total  resource  balance,  must 
be  taken  into  account  when  evaluating  the  workload  for  a 
given  server.  Specifically,  we  noted  that  the  granularity 
of  jobs  in  a  local  queue  impacts  the  performance  of  the 
smaller  servers.  Intuitively,  small  jobs  should  be  sent  to 
small  servers,  and  large  jobs  should  be  sent  to  large  servers. 
Here,  a  large  job  is  one  that  generally  has  large  resource 
requirements,  and  a  large  server  is  one  that  generally  has 
large  resource  capacities.  Note  that  the  size  of  a  job  is  rela¬ 
tive  to  the  size  of  the  server  to  which  it  is  being  compared. 
Our  current  work  in  progress  is  investigating  refinements  to 
the  load  balancing  policies  which  improve  the  affinity  of  the 
local  workload  to  the  local  server.  Note  that  these  investi¬ 
gations  apply  to  single  resource  servers  as  well. 

Second,  there  is  a  persistent  performance  gap  between 
a  central  queue  approach  to  workload  distribution  and  our 
load  balancing  algorithms.  Our  conjecture  is  that  even  if  the 
load  is  perfectly  balanced,  restricting  a  server,  Si ,  to  execute 
jobs  only  from  its  local  queue  will  increase  the  percentage 
of  time  that  some  of  Si's  resources  remain  idle,  when  there 
may  be  a  job  in  the  queue  of  a  different  server,  Sj,  which 
would  fit  in  to  the  idle  resources  of  server  Si .  These  effects 
were  noted  in  our  simulations  in  that  even  when  the  servers 
were  all  nearly  identical,  and  an  equal  load  was  being  de¬ 
livered  to  each  server,  the  system  throughput  was  still  sig¬ 
nificantly  below  the  performance  of  the  central  queue  algo¬ 


rithm.  Load  balancing  schemes  were  limited  to  about  85% 
of  the  throughput  of  the  central  queue  scheme  at  all  tested 
values  of  Srv  and  Src,  as  seen  in  Figures  4(a)-(f). 

We  are  further  motivated  to  look  at  a  more  central¬ 
ized  approach  by  real-world  computational  grids,  such  as 
NASA’s  Information  Power  Grid  (IPG)  [6].  The  current 
implementation  of  the  IPG  uses  services  from  the  Globus 
toolkit  to  submit  jobs,  query  their  status,  and  query  the  state 
of  the  grid  resources.  Globus  uses  a  centralized  directory 
structure,  the  Metacomputing  Directory  Service  (MDS)  to 
store  information  about  the  status  of  the  metacomputing 
environment  and  all  jobs  submitted  to  the  grid.  Informa¬ 
tion  in  the  MDS  is  used  to  assist  in  the  placement  of  new 
jobs  onto  servers  with  appropriate  resources  within  the  grid. 
While  this  approach  is  currently  being  used  in  the  IPG, 
there  are  questions  about  the  scalability  of  such  a  central¬ 
ized  structure.  For  example,  can  a  central  structure  han¬ 
dle  hundreds  of  sites  and  thousands  of  jobs?  How  about 
fault  tolerance?  Our  current  work  in  progress  is  therefore 
investigating  compromises  between  a  single  central  queue 
and  completely  distributed  queues.  The  general  concept  is 
to  keep  work  close  to  the  servers  where  it  will  most  likely 
execute,  and  move  work  to  a  specific  server  when  needed. 
Recent  research  in  dynamic  matching  and  scheduling  for 
heterogeneous  computing  systems  use  similar  approaches, 
along  with  heuristics  for  matching  a  job  to  idle  server  re¬ 
sources  [12].  Our  work  in  progress  attempts  to  combine  the 
centralized  nature  of  current  mapping  approaches  with  our 
resource-balanced  workload  affinity  approach. 
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Abstract 

Providing  up-to-date  input  to  users’  applications  is  an 
important  data  management  problem  for  a  heterogeneous 
distributed  computing  environment,  where  each  data  stor¬ 
age  location  and  intermediate  node  may  have  different  da¬ 
ta  available,  storage  limitations,  and  communication  links 
available.  Sites  in  the  heterogeneous  network  request  data 
items  and  each  request  has  an  associated  deadline  and  pri¬ 
ority.  In  a  military  situation,  the  data  staging  problem  in¬ 
volves  positioning  data  for  facilitating  a  faster  access  time 
when  it  is  needed  by  programs  that  will  aid  in  decision  mak¬ 
ing.  This  work  concentrates  on  solving  a  basic  version  of 
the  data  staging  problem  in  which  all  parameter  values  for 
the  communication  system  and  the  data  request  information 
represent  the  best  known  information  collected  so  far  and 
stay  fixed  throughout  the  scheduling  process.  The  hetero¬ 
geneous  network  is  assumed  to  be  oversubscribed  and  not 
all  requests  for  data  items  can  be  satisfied.  Three  multiple- 
source  shortest-path  algorithm  based  procedures  for  find¬ 
ing  a  near-optimal  schedule  of  the  communication  steps  for 
staging  the  data  are  described.  Each  procedure  can  be  used 
with  each  of  three  cost  criteria  developed  here  (based  on 
results  from  earlier  experiments ).  A  subset  of  the  possi¬ 
ble  procedure/cost  criterion  combinations  are  evaluated  in 
simulation  studies  considering  different  priority  weighting 
schemes,  different  average  number  of  links  used  to  satisfy 
each  data  request,  and  different  network  loadings.  The  pro¬ 
posed  heuristics  are  shown  to  perform  well  with  respect  to 
upper  and  lower  bounds. 


1.  Introduction 

The  DARPA  Battlefield  Awareness  and  Data  Dissemina¬ 
tion  (BADD)  [15]  and  the  Agile  Information  Control  En¬ 
vironment  (AICE)  [2]  programs  include  designing  an  in¬ 
formation  system  for  forwarding  (staging)  data  to  proxy 
servers  prior  to  their  usage  as  inputs  to  a  local  application 
in  a  heterogeneous  distributed  computing  environment,  us¬ 
ing  satellite  and  other  communication  links.  The  focus  is 
on  providing  the  ability  to  operate  in  a  distributed  server- 
server-client  environment  to  optimize  information  currency 
for  many  critical  classes  of  information. 

Data  staging  is  an  important  data  management  problem 
that  needs  to  be  addressed  by  the  BADD  and  AICE  pro¬ 
grams.  A  simplified  informal  description  of  an  example  of  a 
data  staging  problem  in  a  military  application  is  as  follows. 
A  warfighter  is  in  a  remote  location  with  a  portable  com¬ 
puter  and  needs  data  as  input  for  a  program  that  plans  troop 
movements.  The  data  can  include  detailed  terrain  maps, 
enemy  locations,  troop  movements,  and  current  weather 
predictions.  The  data  will  be  available  from  Washington 
D  C.,  foreign  military  bases,  and  other  data  storage  loca¬ 
tions.  Each  location  may  have  specific  data  available,  stor¬ 
age  limitations,  and  communication  links.  Also,  each  data 
request  is  associated  with  a  specific  deadline  and  priority. 
Each  priority  level  then  has  a  corresponding  weight,  so  that 
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two  levels  can  be  compared  analytically.  Depending  on  the 
particular  environment,  there  may  be  hundreds  of  warfight¬ 
ers,  all  making  multiple  requests.  It  is  assumed  that  not 
all  requests  can  be  satisfied  by  their  deadline.  In  a  military 
situation,  the  data  staging  problem  involves  positioning  da¬ 
ta  for  facilitating  a  faster  access  time  when  it  is  needed  by 
programs  that  will  aid  in  decision  making. 

Positioning  the  data  before  it  is  needed  can  be  complicat¬ 
ed  by:  the  dynamic  nature  of  data  requests  and  network  con¬ 
gestion;  the  limited  storage  space  at  certain  sites;  the  lim¬ 
ited  bandwidth  of  links;  the  changing  availability  of  links 
and  data;  the  time  constraints  of  the  needed  data;  the  pri¬ 
ority  of  the  needed  data;  and  the  determination  of  where 
to  stage  the  data  [16].  Also,  the  associated  garbage  collec¬ 
tion  problem  (i.e.,  determining  which  data  will  be  deleted 
or  reverse  deployed  to  rear-sites  from  the  forward-deployed 
units)  arises  when  existing  storage  limitations  become  criti¬ 
cal  [15, 16].  The  multiple  copies  provide  an  increased  level 
of  fault  tolerance,  in  cases  of  links  or  storage  locations  go¬ 
ing  off-line,  and  allow  the  scheduler  to  select  from  among 
different  sources  to  satisfy  a  data  request  [18]. 

The  simplified  data  staging  problem  addressed  here  re¬ 
quires  a  schedule  for  transmitting  data  between  pairs  of 
nodes  in  the  corresponding  communication  system  for  sat¬ 
isfying  as  large  a  sum  of  weighted  priorities  as  possible. 
Each  node  in  the  system  can  be:  (a)  a  source  machine  of 
initial  data  items;  (b)  an  intermediate  machine  for  storing 
data  temporarily;  and/or  (c)  a  final  destination  machine  that 
requests  a  specific  data  item. 

It  is  also  assumed  in  this  simplified  model  of  the  da¬ 
ta  staging  problem  that  all  parameter  values  for  the  com¬ 
munication  system  and  the  data  request  information  (e.g., 
network  configuration  and  requesting  machines)  represent 
the  best  known  information  collected  so  far  and  stay  fixed 
throughout  the  scheduling  process.  It  is  assumed  that  not 
all  of  the  requests  can  be  satisfied  by  their  deadlines  due  to 
storage  capacity  and  communication  constraints.  The  mod¬ 
el  is  designed  to  create  a  schedule  for  movement  of  data 
from  the  source  of  the  data  to  a  “staged”  location  for  the  da¬ 
ta.  It  is  assumed  that  a  user’s  application  can  easily  retrieve 
the  data  from  this  location. 

Three  multiple-source  shortest-path  algorithm  based 
procedures  for  finding  a  near-optimal  schedule  of  the  com¬ 
munication  steps  for  staging  the  data  are  described  [20]. 
Each  procedure  can  be  used  with  each  of  seven  cost  criteria 
developed.  A  subset  of  fourteen  of  the  possible  21  resulting 
heuristics  that  are  expected  to  perform  well  (based  on  exper¬ 
iments  in  [20])  are  examined  in  simulation  studies  consider¬ 
ing  different  priority  weighting  schemes,  different  average 
number  of  links  used  to  satisfy  each  data  request,  and  dif¬ 
ferent  network  loadings.  The  rationale  for  considering  each 
of  these  procedures  and  costs  is  provided.  The  proposed 
heuristics  are  shown  to  perform  well  with  respect  to  upper 


and  lower  bounds.  Furthermore,  the  heuristics  using  a  com¬ 
plex  cost  criterion  are  shown  to  allow  more  highest  priority 
messages  to  be  received  than  a  simple-cost-based  heuristic 
that  schedules  all  highest  priority  messages  first.  Finally, 
an  approach  considering  data  items  with  “more  desirable 
and  “less  desirable”  available  versions  is  evaluated  using  a 
variable  time,  variable  accuracy  algorithm,  and  simulation 
results  are  presented.  This  research  serves  as  a  necessary 
step  toward  solving  the  more  realistic  and  complicated  ver¬ 
sion  of  the  data  staging  problem  involving  fault  tolerance, 
dynamic  changes  to  the  network  configuration,  ad  hoc  data 
requests,  sensor-triggered  data  transfers,  etc. 

The  material  in  this  paper  extends  the  earlier  work  pres¬ 
ented  in  [19]  by  introducing  three  new  cost  criteria  and  two 
new  bounds.  This  work  also  varies  additional  simulation 
parameters,  including  eight  network  loadings,  three  aver¬ 
age  numbers  of  links  used  to  get  from  a  source  machine  to 
a  destination  machine,  and  five  priority  weighting  schemes. 
This  paper  also  introduces  a  variable  time,  variable  accura¬ 
cy  approach  for  using  data  items  with  “more  desirable  and 
“less  desirable”  versions. 

Section  2  provides  an  overview  of  work  that  is  related 
to  the  data  staging  problem.  In  Section  3,  a  mathematical 
model  for  a  basic  data  staging  problem  is  reviewed.  Section 
4  provides  a  description  of  Dijkstra’s  algorithm  used  to  find 
paths  of  links  for  transferring  data  items  within  the  present¬ 
ed  network  model.  Section  5  presents  seven  cost  criteria  for 
use  in  conjunction  with  different  resource  allocation  proce¬ 
dures.  Three  multiple-source  shortest-path  algorithm  based 
procedures  for  finding  a  near-optimal  schedule  of  the  com¬ 
munication  steps  for  data  staging  are  described  in  Section  6. 
These  heuristics  adopt  the  simplified  view  of  the  data  stag¬ 
ing  problem  described  by  the  mathematical  model.  Three 
upper  bounds  and  three  lower  bounds  used  to  evaluate  the 
performance  of  these  heuristics  are  presented  in  Section  7. 
The  set  of  simulation  studies  given  in  Section  8  were  creat¬ 
ed  after  studying  the  results  of  [19].  These  new  simulation 
studies  examine  the  effects  of  (1)  having  six  priority  lev¬ 
els  with  five  different  weighting  schemes,  (2)  varying  the 
average  number  of  links  required  for  a  data  item  to  reach  a 
destination  from  its  source,  and  (3)  varying  the  total  number 
of  requests  that  must  be  scheduled  in  a  given  network.  In 
Section  9,  an  approach  considering  data  items  with  “more 
desirable”  and  “less  desirable”  available  versions  is  evalu¬ 
ated  using  a  variable  time,  variable  accuracy  algorithm,  and 
simulation  results  are  presented. 

2.  Related  Work 

To  the  best  of  the  authors’  knowledge,  there  is  currently 
no  other  work  presented  in  the  open  literature  that  address¬ 
es  the  data  staging  problem,  designs  a  mathematical  model 
to  quantify  it,  or  presents  a  heuristic  for  solving  it.  Due  to 
space  constraints,  the  reader  is  referred  to  [6]  for  a  more 
thorough  discussion  of  the  related  work.  A  problem  that  is, 
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at  a  high  level,  remotely  similar  to  data  staging  is  the  facili¬ 
ty  location  problem  in  management  science  and  operations 
research  (e.g.,  [13]),  Data  management  problems  similar  to 
data  staging  for  the  BADD/AICE  program  are  studied  for 
other  communication  systems  [1, 3, 5,  1 1],  Other  areas  that 
are  somewhat  related  include  modifying  routing  schemes 
[4],  mapping  tasks  onto  a  suite  of  distributed  heterogeneous 
machines  (e.g.,  [8,  9,  21]),  and  earliest  deadline  first  [7,  17] 
scheduling  for  real-time  systems.  Lastly,  other  research  ex¬ 
ploring  heuristics  for  use  in  the  BADD/AICE  environment 
have  been  performed  [14].  All  of  this  research  is  related, 
but  does  not  develop  a  mathematical  model  like  the  one 
researched  here  nor  do  they  examine  a  network  similar  to 
BADD-like  network  being  used  in  this  research. 

3.  Mathematical  Data  Staging  Model 

3.1.  Model  Definition 

Some  of  this  background  material  is  based  on  [20],  and 
is  included  here  for  completeness.  It  has  been  expanded  to 
include  all  of  the  concepts  needed  for  the  new  experiments 
and  results  presented  here. 

Consider  a  network  topology  graph  Gnt  composed  of  a 
set  of  vertices  that  represent  the  set  of  machines  M  in  the 
network  and  a  set  of  directed  edges  that  represent  the  set  of 
communication  links  L.  There  are  m  machines  in  M,  iden¬ 
tified  as  {M[0],  M[  1], . . . ,  M[m  -  1]},  and  each  can  be  a 
source,  destination,  and  intermediate  location  for  data  items 
in  the  network.  Source  machines  for  data  items  are  the  ma¬ 
chines  where  data  items  are  initially  located  within  the  net¬ 
work;  these  data  items  may  eventually  be  transferred  by  the 
network  to  destination  machines,  possibly  stored  at  interme¬ 
diate  machines  along  the  way.  Each  machine  M[i]  (where 
0  <  i  <  m)  also  has  an  associated  constant  unused  stor¬ 
age  capacity  during  the  time  interval  [tj,tj+1),  Cap[i\(tj). 
Note  that  the  times  tj  and  tj+1  may  not  differ  by  exactly 
one  time  unit. 

Each  Communication  link  in  this  system  is  represented 
as  one  or  more  virtual  links.  A  virtual  link  corresponds  to 
a  period  of  constant,  continuous,  available  bandwidth  from 
one  machine  to  one  other  machine.  Bidirectional  communi¬ 
cation  links  are  therefore  represented  as  two  virtual  links — 
one  for  each  direction.  Nl[i,  j]  is  the  number  of  virtu¬ 
al  links  from  machine  M[i\  to  M[j]  (where  i  ^  j  and 
0  <  i,j  <  m).  The  A:th  virtual  link  from  machine  M[i\ 
to  M[j]  is  identified  as  L\jJ\[k\  (where  0  <  k  <  Nl[i,j ]). 
The  virtual  link  L[i,  j]  [A']  also  has  an  associated  link  starting 
time  Lst[i,j][k],  denoting  the  time  when  it  becomes  avail¬ 
able,  as  well  as  a  link  ending  time  Let\i,  j\\k\,  which  spec¬ 
ifies  the  time  when  the  link  is  no  longer  available. 

Data  items  are  blocks  of  information  that  can  be  trans¬ 
mitted  from  one  machine  to  another.  The  set  of  data  items 
with  unique  names  or  identifiers  that  are  available  on  the 
machines  in  M  is  called  A.  Names  or  identifiers  assigned 


to  data  items  must  be  different  if  the  contents  of  the  data 
items  are  different  in  any  way,  including  details  such  as  dif¬ 
fering  timestamps  on  weather  maps  of  the  same  region.  The 
number  of  distinctive  data  items  in  A  is  n,  and  individual 
unique  data  items  are  identified  as  {<$[0],  <5[1],  ...,8[n- 1]}. 
For  a  data  item  5[Z]  (where  0  <  Z  <  n),  the  size  of  the  da¬ 
ta  item  is  represented  as  |<S[Z]|.  The  time  duration  required 
to  transfer  data  item  5 [/ ]  from  machine  M[i]  to  machine 
M[i]  (where  i  ^  j  and  0  <  i,  j  <  m)  via  the  virtual  link 
£[*>  (where  0  <  k  <  Nl[i,j])  during  the  time  interval 
[Lst[i,j][k],  Let[i,j][k]\  isPfcjPKIfl/H),  Machine  M[i] 
may  be  a  source  of  <5[Z],  or  an  intermediate  storage  location 
or  destination  that  already  holds  a  copy  of  <S[Z],  Machine 
M[j]  may  be  an  intermediate  storage  location  or  a  destina¬ 
tion. 

Let  N8jl]  (where  0  <  l  <  n)  represent  the  num¬ 
ber  of  source  machines  holding  a  copy  of  £[/],  and 
M[Source[l,j}]  represent  the  jth  source  machine  for  da¬ 
ta  item  £[Z]  (where  0  <  j  <  NS[l]  and  0  <  Source[l,j]  < 
m).  The  starting  time  SstfL,  j]  refers  to  the  time  data  item 
<5[Z]  becomes  available  at  its  jth  source  machine.  The  re¬ 
moval  time  <fr/[Z,  i]  (where  0  <  i  <  m)  refers  to  the  time 
data  item  tf[Z]  can  be  removed  from  machine  M[i],  if  a  copy 
of  J[Z]  is  being  stored  at  M[i).  This  allows  the  value  of 
Cap[i](6rt[l,  i])  to  be  increased  by  |tf[Z]|.  Intermediate  ma¬ 
chines,  for  example,  could  set  8rt[l,  i]  to  be  some  small  time 
period  7  after  the  last  deadline  at  any  machine  for  data  item 
<5[Z],  This  would  allow  the  storage  space  to  be  reclaimed  at 
intermediate  machines  after  the  usefulness  of  the  data  item 
has  expired.  The  scheduling  heuristics  do  not  remove  a  data 
item  from  any  of  its  sources  or  destinations  because  this  is 
considered  outside  the  scope  of  responsibility  of  the  sched¬ 
uler. 

Consider  now  a  data  item  such  as  an  image  showing  a 
map  of  a  planned  battle  area.  It  may  be  possible  to  have 
available  a  higher  quality  version  of  the  image  that  shows 
a  higher  level  of  detail,  as  well  as  a  lower  quality  version 
showing  less  detail.  An  application  requesting  this  data  item 
would  prefer  to  receive  the  higher  quality  image,  but  it  may 
be  that  there  are  not  enough  resources  (e.g.,  network  band¬ 
width)  available  to  fulfill  this  data  request.  In  this  event, 
however,  there  may  be  enough  resources  available  to  send 
the  lower  quality  image,  which  would  be  better  than  sending 
nothing  at  all. 

The  set  Rq  (where  Rq  C  A)  contains  unique  data  items 
requested  by  destination  machines  in  M.  The  number  of 
unique  data  items  in  Rq  is  2 p.  The  higher  quality  data  items 
are  identified  as  {i?g[0],  Jfy[l],  . . . ,  Rq[p  -  1]},  and  the 
lower  quality  data  items  are  identified  as  {ity[p],  Rq[p  + 
1],  . . . ,  Rq[2p  —  1]}.  Here,  each  requested  higher  quality 
data  item  Rq\t\  (where  0  <  i  <  p)  has  a  corresponding 
lower  quality  data  item  Rq[i  +  p]  that  may  be  sent  in  place 
of  7?g[i]  if  system  resources  become  limited.  Note  that  for 
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every  i  there  must  exist  exactly  one  j  and  exactly  one  k 
such  that  Rq\i]  =  <5[j]  and  Rq[i  +  p]  =  <5[/c].  These  data 
items  <5[j]  and  8[k]  are  assumed  for  simplicity  to  be  present 
at  the  same  source  machines,  and  to  have  the  same  asso¬ 
ciated  starting  times  and  removal  times.  This  model  also 
assumes  for  simplicity  that  \Rq[i  +  p]\  =  §  !#<?[*]  I- 

The  number  of  destination  machines  that  request  Rq[i] 
(where  0  <  i  <  p)  is  denoted  with  jVrq[i].  If  0  < 
k  <  Nrq[i ],  then  M[Request[i,k}}  refers  to  the  fcth  ma¬ 
chine  that  requested  Rq[i]  (where  0  <  Request[i,  fc]  <  m). 
Each  of  these  machines  also  implicitly  requests  Rq[i  +  p ] 
in  the  event  that  Rqlj]  cannot  be  sent,  so  that  Nrq[i  + 
p ]  =  Nrq[i ],  and  Request[i  +  p,k]  =  Request[i,k] 
for  all  values  of  k.  The  finishing  time  Rft^,  k]  (and 
equivalent  Rft[i  +  p,  fc])  refers  to  a  deadline  time,  af¬ 
ter  which  data  item  Rq[i ]  (and  Rq[i  +  p})  is  no  longer 
useful  to  machine  M[Request[i,k]].  The  requesting  ma¬ 
chine  M[Request[i,  fc]]  also  associates  the  data  item  Rq[i] 
with  a  numbered  priority  class  Priority [z,  fc]  (equal  to 
Priority[i  +  p,  k]).  The  highest,  or  most  important  pri¬ 
ority  class  is  P,  and  the  lowest,  or  least  important  priority 
class  is  0,  so  that  0  <  Priority[i,  fc]  <  P.  In  actual  sys¬ 
tems,  the  deadline  and  priority  for  a  data  request  would  be 
set  by  some  combination  of  the  user,  application,  system 
administrator,  and  commander. 

Define  a  schedule  as  a  series  of  communication  steps, 
among  the  machines  of  M  using  the  communication  links 
in  L,  that  transfer  some  or  all  of  the  data  items  in  the 
set  Rq  from  their  respective  source  machines  to  some  or 
all  of  their  respective  destination  machines,  possibly  be¬ 
ing  stored  at  intermediate  machines  along  the  way.  Sup¬ 
pose  that  there  are  o  possible  distinct  schedules,  enumerat- 
ed  {S0,  Si, ,  S„_i}.  The  fcth  (where  0  <  k  <  Nrq[j ]) 
request  for  a  data  item  Rq[j]  (where  0  <  j  <  2 p)  is  con¬ 
sidered  satisfiable  with  respect  to  a  specific  schedule  S* 
(where  0  <  h  <  a)  if  and  only  if  the  data  item  Rq[j] 
is  available  at  machine  M[Request[j,k]]  at  or  before  the 
deadline  time  Rft[j,  k\.  The  set  Srq[Sh]  then  denotes  the 
set  of  two-tuples  ( j,  k)  such  that  the  fcth  request  for  the  data 
item  Rq[j]  is  satisfiable  with  respect  to  the  schedule  Sh- 

There  must  be  a  way  to  represent  the  relative  importance 
of  a  priority  class  a  (where  0  <  a  <  P)  compared  to  an¬ 
other  priority  class  /?  (where  0  <  /?  <  P  and  a  ^  /?).  The 
relative  weight  of  any  priority  class  a  is  denoted  by  W[a\. 
This  means  that  if  priority  class  a  is  ten  times  as  impor¬ 
tant  as  priority  class  (5 ,  then  W[a]  =  10  *  W[(3\.  In  an 
actual  system,  these  weights  would  be  set  by  the  system  ad¬ 
ministrator  and  commander,  and  would  be  a  function  of  the 
current  operating  situation  (e.g.,  peace  or  war). 

Let  Worth\j,  k]  (where  0  <  j  <  2p  and 

0  <  k  <  Nrq[j})  denote  a  percentage  of  value  to  a 
user  of  data  item  Rq[j]  sent  to  satisfy  a  request  at  machine 
M[Reque$t\j,k}).  that  if  Rq[i\  for  0  <  i  <  p  is  sent  to 


M[Request[i,k})  by  its  deadline,  then  Worth[i,k\  =  1 
(meaning  100%  for  the  preferred  data  version),  and 
Worth[i  +  p,  k]  =  0  (meaning  no  additional  worth 
for  the  second  data  version).  If  Rq[i]  is  not  sent  to 
M[Reqnest[i,  k]\  by  its  deadline,  and  Rq[i  4-  p ]  is 
sent  to  M[Request[i  4  p,k)\  by  its  deadline,  then 
Worth[i  +  p,k\~  0.25  (meaning  25%  for  the  lower  qual¬ 
ity  version),  and  Worth[i,  k]  =  0.  Now,  the  effect  of  the 
schedule  Sh  (where  0  <  h  <  a)  can  be  defined  as  E[Sh ]  = 

”  {^{j,k)eSrq[Sh]  W[Priority[j ,  k ]]  *  W orth[j ,  k]^j 
(where  0  <  j  <  2p  and  0  <  k  <  Nrq[j]).  The  global 
optimization  criterion,  and  hence,  the  objective  of  all  of 
the  heuristics  presented  later,  is  to  find  the  schedule  with 
the  minimum  effect,  defined  as  min0</l<<T  jE[S/i]-This 
performance  criterion  is  related  to  the  one  described  in 
[12].  Another  way  to  view  this  minimization  is  to  think 
of  it  as  trying  to  find  the  schedule  of  data  transfers  that 
produces  the  maximum  sum  of  satisfied  requests’  priority 
weights. 

3.2.  Heuristic  Solution  Approach 

The  heuristic  approach  used  here  to  create  the  schedule 
Sh  with  minimum  effect  E[Sh]  utilizes  Dijkstra’s  shortest 
path  algorithm.  This  algorithm,  presented  in  Section  4,  cal¬ 
culates  arrival  times  for  data  items  and  establishes  paths  of 
virtual  links  to  get  data  items  from  source  machines  to  des¬ 
tination  machines.  The  paths  calculated  by  this  algorithm 
give  the  earliest  arrival  time  for  a  given  data  item,  based  on 
the  expected  system  resources  available  when  the  algorithm 
is  run,  and  ignores  any  future  competition  for  resources 
among  the  pending  data  requests,  provided  that  there  are 
no  other  data  items  competing  for  resources  in  the  network. 
After  Dijkstra’s  algorithm  has  been  run  for  each  requested 
data  item  (i.e.,  all  data  items  in  Rq),  a  single  data  item  and 
one  or  more  destination  machines  are  selected  through  the 
use  of  a  cost  criterion  presented  in  Section  5.  This  data  item 
choice  reflects  a  combination  of  its  contribution  to  the  effect 
of  the  schedule,  and  the  amount  of  time  between  its  arrival 
at  a  destination  and  its  deadline  at  that  destination.  Network 
resources  and  machine  storage  are  then  allocated  according 
to  one  of  the  procedures  presented  in  Section  6,  updating 
link  availability  times  and  available  machine  storage.  This 
updating  of  network  information  will  cause  the  arrival  times 
and  virtual  link  paths  for  some  other  data  items  to  become 
invalid,  so  the  heuristic  process  (using  a  cost  and  an  alloca¬ 
tion  procedure)  is  repeated  again  (beginning  with  Dijkstra’s 
algorithm)  using  the  modified  network  information.  This 
continues  until  there  are  no  more  satisfiable  data  items  in 
the  network,  thus  producing  the  communication  schedule. 
Results  from  simulation  studies  using  this  approach,  which 
only  considers  one  version  of  each  data  item  (i.e.,  considers 
only  Rq[i]  where  0  <  i  <  p,  not  Rq[j]  where  p<j<  2  p), 
are  found  in  Section  8.  A  modified  approach  considering 
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both  versions  of  a  data  item  is  contained  in  Section  9. 

4.  Dijkstra’s  Shortest  Path  Algorithm 

The  heuristics  presented  here  utilize  Dijkstra’s  algorithm 
[10]  for  finding  the  shortest  path  from  one  or  more  source 
nodes  to  all  other  nodes  in  a  directed  graph.  The  version 
used  calculates  the  earliest  possible  available  time  for  a  data 
item  Rq[i]  (where  0  <  i  <  2p)  at  each  machine  in  M, 
given  a  subset  of  machines  in  M  that  already  holds  a  copy 
of  Rq[i\. 

Define  the  available  time  Ax[i,j]  (where  0  <  i  <  2 p, 
0  <  j  <  m)  as  the  earliest  possible  time  found,  by  ex¬ 
ecuting  Dijkstra’s  algorithm,  when  data  item  Rq[i]  could 
be  present  and  available  at  machine  M[j],  Define  also  the 
value  of  the  predecessor  n[i,j]  to  be  the  two-tuple  (s,k) 
(where  -1  <  s  <  m,  —  1  <  k  <  Nl[s,  j])  identifying  the 
machine  M[s]  as  the  machine  that  sends  data  item  Rq[i]  to 
machine  M\j\  via  virtual  link  L[s.  j][k].  This  predecessor 
is  also  determined  by  the  execution  of  Dijkstra’s  algorithm. 
If  the  value  of  ir [i,  j]  is  (-1,  —1),  this  means  that  no  ma¬ 
chine  sends  data  item  Rq[i]  to  machine  M[j]  via  any  virtual 
link.  This  may  happen  if  machine  M[j]  is  a  source  machine 
for  data  item  Rq[i],  or  it  may  happen  if  it  is  not  possible  for 
machine  M[j]  to  receive  a  copy  of  data  item  Rq[i]  (possibly 
due  to  the  unavailability  of  network  resources).  For  more  in¬ 
formation  about  the  implementation  of  Dijkstra’s  algorithm, 
including  pseudocode  and  examples,  the  reader  is  referred 
to  [6], 

5.  Data  Item  Selection  Cost  Criteria 
5.1.  Introduction 

Network  resources  must  be  allocated  to  data  requests  in 
some  order;  this  order  intuitively  should  include  “more  im¬ 
portant”  requests  and  requests  that  are  “close”  to  their  dead¬ 
lines  before  “less  important”  requests  and  requests  that  are 
“not  close”  to  their  deadlines.  Dijkstra’s  algorithm  is  used 
here  for  each  data  item  individually,  as  if  it  were  the  only 
request  in  the  system  remaining  to  be  satisfied.  Thus,  Di¬ 
jkstra’s  algorithm  is  executed  for  each  remaining  data  item 
separately.  Some  quantitative  cost  must  therefore  be  ap¬ 
plied  so  that  an  algorithm  can  evaluate  the  relative  merit  of 
any  given  request  compared  to  any  other  request.  Seven  dif¬ 
ferent  cost  criteria  are  detailed  below;  each  attempts  to  take 
into  consideration  both  the  importance  of  a  data  request,  and 
how  close  the  data  request  is  to  its  deadline. 

Suppose  M[r]  (where  0  <  r  <  m)  is  the  next  machine  to 
receive  data  item  Rq[i]  (where  0  <  i  <  2p)  on  a  path  from 
M[s]  (where  (s,Z)  =  7r[i,  r]),  which  can  be  any  machine 
already  holding  a  copy  of  Rq[i],  to  one  or  more  requesting 
destination  machines.  That  is,  machine  M[s]  holds  a  copy 
of  data  item  and  M[r]  must  be  the  next  machine  to 
receive  Rq[i]  so  that  M[Request[i,  fc]]  (for  one  or  more  val¬ 
ues  of  k,  where  0  <  k  <  Nrq[i ])  can  ultimately  receive 


Rq[i\.  Let  the  set  of  values  of  k  that  satisfy  this  condition 
(i.e.,  destination  machines  that  request  Rq[i\  through  M[r]) 
be  called  Drq[i,r], 

Assume  that  Rq[i]  is  the  next  data  item  to  be  allo¬ 
cated  network  resources.  Let  the  value  Sat[i,  k]  (where 
0  <  i  <  2p  and  0  <  k  <  Nrq[i])  be  1  if  Request[i,  fc] 
would  be  satisfiable,  and  0  if  it  would  not  be  satisfiable. 
For  the  simulations  of  Section  8,  Sat[i,  k]  is  0  for  val¬ 
ues  of  i  such  that  p  <  i  <  2 p,  thus  ignoring  the  less 
desirable  data  item  versions.  Now,  the  effective  priori- 
ty  Efp[i,  fc]  of  data  item  Rq[i]  at  the  Arth  requesting  lo¬ 
cation  can  be  defined  as  Saf[i,A;]  *  W[Priority[i,k]]  * 
Worth[i,  k].  An  urgency  term,  indicating  how  close  a  data 
item’s  available  time  is  to  its  deadline  time  (in  seconds)  at 
a  destination  is  defined  as  Urgency [»,  k]  =  -Sat[i,  k]  * 
(Rft[i,  k]  -  AT[i,  Request[i,  &]]  +  1).  A  smaller  urgen¬ 
cy  here  indicates  that  it  is  less  urgent  to  get  Rq[i]  to 
M[Request[i,  &]].  The  “+1”  in  the  urgency  term  is  so  that 
the  urgency  never  becomes  a  small  number  close  to  zero. 

The  next  value  that  must  be  defined  before  detailing  the 
cost  criteria  is  the  number  of  virtual  links  used  to  get  from  a 
machine  M[s]  to  a  destination  machine  M[Request[i,k}\, 
where  k  e  Drq[i,  r].  Let  this  value  be  called  Nlinks[i,  jfe], 
and  note  that  it  reflects  the  number  of  links  used  in  the  path 
(generated  by  the  most  recent  run  of  Dijkstra’s  algorithm) 
from  a  machine  holding  the  data  item  to  a  machine  request¬ 
ing  the  data  item. 

All  of  the  following  cost  functions  take  into  account  the 
priority  and  urgency  of  a  data  item.  For  all  cost  criteria,  a 
smaller  value  indicates  a  more  desirable  use  of  communica¬ 
tion  resources;  therefore,  resource  allocation  is  performed 
by  the  procedures  in  Section  6  for  the  data  item  and  desti¬ 
nation  machine(s)  with  minimum  cost. 

Six  of  the  costs  allow  the  weight  assigned  to  the  priority 
term  to  be  varied  relative  to  the  weight  assigned  to  the  ur¬ 
gency  term.  These  weighting  terms  are  W_e  for  the  weight 
of  the  effective  priority  term,  and  Wu_  for  the  weight  of  the 
urgency  term.  The  relative  weight  of  these  two  terms  com¬ 
pared  to  each  other  {WE/Wu)  is  called  the  EJJ  ratio. 

5.2.  Costs  Cl,  C2,  and  (73 

Four  cost  criteria  were  developed  in  the  previous  re¬ 
search  that  combine  the  above  effective  priority  and  urgency 
terms.  The  best  performing  cost,  (74,  was  the  basis  for  the 
work  presented  in  this  paper.  The  definitions  of  (71,(72, 
and  (73  are  not  discussed  in  detail  in  this  paper.  The  reader 
is  referred  to  [20]  for  more  information  about  these  three; 
(74  will  be  discussed  in  more  detail  below.  The  mathemat¬ 
ical  definitions  of  these  three  cost  criterion  are  included  for 
reference.  Each  cost  is  for  sending  data  item  Rq[i]  (where 
0  <  i  <  2p)  to  M[r }  (where  0  <  r  <  m)  from  M[s]  via 
link  L[s,  r][k]  (where  (s,  k)  =  i r[i,  r]),  in  order  to  ultimate¬ 
ly  try  to  satisfy  the  jth  (where  0  <  j  <  Nrq[i ])  requesting 
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destination  machine: 

Cl[i,j][s,rp]  =  -WE*Efp[i,j]-Wu*Urgency[i,j] 
C2[i\[s,r][k]  =  {-WE  *  Ejez>r9[i,r]  EfP[i,j}) 

+  {-Wu  *  maxi6Drg[iir]  Urgency[i,j}) 

C'3[*][s,r][fc]  =  J2j€Drq[i,r]  Urgen’cy[i,j] 

5.3.  Cost  C4 

The  cost  (74  for  transferring  data  item  Rq[i]  (where 
0  <  2  <  2/9)  to  M[r]  (where  0  <  r  <  m)  from  M[s]  via 
link  L[s,  r][k\  (where  (5,  k)  -  tt [z,  r]),  in  order  to  ultimate¬ 
ly  try  to  satisfy  the  jth  (where  j  G  Drq[i,r\ )  requesting 
destination  machine(s): 

C4[i][s,r][fc]  =  -WE  *  (Ej6Dr«,[i,r]  Efp[i,j ]) 

~Wu  (E Urgency#,}])  . 

This  cost  sums  the  weighted  priorities  of  all  satisfiable 
requests  for  data  item  Rq[i\  on  a  path  through  machine 
M[r)  and  combines  that  with  the  sum  of  the  urgency  for 
those  same  satisfiable  requests. 

5.4.  Cost  CAlinks 

Based  on  (74  because  of  its  high  performance  in  simula¬ 
tion  tests,  cost  C4links  is  also  defined  for  transferring  data 
item  Rq[i]  (where  0  <  i  <  2 p)  to  M[r]  (where  0  <  r  <  m) 
from  M[s]  via  link  L[s,  r][k]  (where  (s,  k)  =  1 r[z,  r]),  in  or¬ 
der  to  ultimately  try  to  satisfy  the  jth  (where  j  €  Drq[i,  r]) 
requesting  destination  machine(s): 

C4links[i][s,r][k]  =  —WE  *  (Ej6Or,[i,r]  ) 

-Wu  *  {EjeDrg[i,r}  Urgency[i,j ])  . 

Because,  for  example,  a  data  request  that  can  be  satisfied  by 
using  three  virtual  links  is  using  three  times  as  much  net¬ 
work  resources  as  a  data  request  of  the  same  size  that  can 
be  satisfied  by  using  only  one  virtual  link,  this  cost  divides 
the  effective  priority  term  for  each  requesting  destination 
by  the  number  of  links  used  to  get  to  that  destination.  If  the 
effective  priority  associated  with  a  data  request  is  consid¬ 
ered  as  a  measure  of  worth  or  importance  to  the  user,  then 
this  first  term  would  be  considered  a  measure  of  worth  per 
link.  This  should  allow  the  cost  criterion  to  better  select  da¬ 
ta  items  to  satisfy  that  will  make  the  most  effective  use  of 
the  network  resources  available. 

5.5.  Cost  CAsize 

Based  again  on  (74  because  of  positive  simulation  re¬ 
sults,  the  criterion  CAsize  is  also  defined  for  transferring 
data  item  Rq[i ]  (where  0  <  i  <  2 p)  to  M[r]  (where 
0  <  r  <  m)  from  M[s ]  via  link  L[s,  r][fc]  (where  (5,  k)  = 


7 r[i,r]),  in  order  to  ultimately  try  to  satisfy  the  j th  (where 
j  £  Drq[i,r\)  requesting  destination  machine(s): 

C4size\i}[s,r}[k]  =  -WE  *  (E^r^r]  7®jf) 

-Wu  *  {EjeDrq[i,r]  Urgency[i,j]J  . 

A  data  request  with  an  effective  priority  p  representing  its 
worth  to  the  recipient,  and  a  size  in  bytes  of  q,  then  has  an 
effective  worth  per  byte  of  Because  the  goal  of  a  cost 
criterion  is  to  identify  data  requests  that  will  make  the  most 
effective  use  of  network  resources,  the  first  term  in  CAsize 
uses  this  effective  priority  divided  by  data  request  size  to 
find  data  items  that  will  transmit  the  maximum  amount  of 
worth  per  bandwidth  byte. 

5.6.  Cost  CAsizlnk 

Cost  CAsizlnk  is  a  combination  of  the  ideas  in  CAsize 
and  CAlinks ,  and  gives  a  cost  for  transferring  data  item 
Rq[i\  (where  0  <  i  <  2p)  to  M[r ]  (where  0  <  r  <  m)  from 
M[s]  via  link  L[s,r][k]  (where  (s,k)  =  7r[i,r]),  in  order 
to  ultimately  try  to  satisfy  the  j  th  (where  j  G  Drq[i,r]) 
requesting  destination  machine(s): 

CAsizlnk[i][s,r][k]  = 

tt/  .  (s?*  EfpjiJ)  \ 

WE  ^  y2sj€Drq[i,r]  \Rq[i]\*Nlinks[iJ]  J 

Wu  *  (E jeDrq[i,r}  Urgency[i,  j])  . 

By  combining  the  size  and  number  of  virtual  links  used, 
this  cost  gives  a  more  accurate  calculation  of  the  resources 
used  by  a  data  request.  For  instance,  consider  two  data 
items  Rq[i\]  and  Rq[i2]  of  equal  priority.  Consider  also 
that  Rq[i2]  is  twice  as  large  as  Rq[ii\,  and  that  it  requires 
the  use  of  three  virtual  links  versus  iZg[ii]’s  single  virtu¬ 
al  link.  In  this  case,  Rq[i2 ]  is  requiring  six  times  the  total 
network  resources  required  by  Rq[i\]  in  order  to  satisfy  the 
same  priority  level  of  request. 

6.  Resource  Allocation  Procedures 

6.1.  Introduction 

The  three  procedures  below  allocate  varying  amounts  of 
network  resources  for  a  single  data  item  after  each  run  of 
Dijkstra’s  algorithm,  based  on  a  cost  function  from  Section 
5.  The  performance  of  these  procedures  is  shown  in  Section 
8. 

The  resource  allocations  performed  by  these  proce¬ 
dures  update  the  following  information  in  the  system  after 
scheduling  Rq[i]  to  move,  and  before  running  Dijkstra’s  al¬ 
gorithm  again:  (1)  the  list  of  virtual  links  and  their  start 
and  stop  times,  (2)  the  available  memory  capacity  on  any 
machines  that  data  item  Rq[i]  has  been  placed,  (3)  the  list 
of  machines  on  which  Rq[i]  is  available,  and  (4)  the  time 
at  which  Rq[i]  can  be  removed  from  any  intermediate  ma¬ 
chines. 
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6.2.  Partial  Path  Procedure 

Each  iteration  of  this  procedure  involves:  (1)  perform¬ 
ing  Dijkstra’s  algorithm  for  each  data  request  individually; 
(2)  for  the  valid  next  communication  steps,  determining  the 
“cost”  to  transfer  a  data  item  to  its  successor  in  the  shortest 
path;  (3)  picking  the  lowest  cost  data  request  and  transfer¬ 
ring  that  data  item  to  the  successor  machine  (making  this 
machine  an  additional  source  of  that  data  item);  (4)  updat¬ 
ing  system  parameters  to  reflect  resources  used  in  (3);  and 
(5)  repeating  (1)  through  (4)  until  there  are  no  more  sat- 
isfiable  requests  in  the  system.  In  some  cases,  Dijkstra’s 
algorithm  would  not  need  to  be  executed  each  iteration  for 
a  particular  data  transfer,  i.e.,  if  the  data  transfer  did  not  use 
resources  needed  for  any  future  transfers.  In  this  study,  only 
one  data  item  is  scheduled  before  rerunning  Dijkstra’s  algo¬ 
rithm  (this  applies  for  all  three  procedures).  This  simplified 
the  implementation  of  the  procedures  without  changing  the 
performance  of  the  resulting  schedules.  The  execution  time 
of  the  procedures  is  affected;  however,  minimizing  this  is 
not  the  main  goal  of  the  work. 

This  procedure  will  schedule  the  transfer  for  the  sin¬ 
gle  “most  important”  request  that  must  be  transferred  next, 
based  on  a  cost  criterion.  The  procedure  (first  described  in 
[19])  is  called  the  partial  path  procedure  because  only  one 
successor  machine  in  the  path  is  scheduled  at  each  itera¬ 
tion.  If  a  data  item  is  partially  scheduled  through  the  sys¬ 
tem  and  because  of  other  scheduled  transfers  the  request¬ 
ing  destination’s  deadline  is  no  longer  satisfied,  the  sched¬ 
uled  transfers  remain  in  the  system  (the  initial  transfers  were 
scheduled  because  the  deadline  could  have  been  satisfied). 
Reasons  the  schedule  for  this  now  unsatisfiable  request  is 
not  removed  include:  (1)  in  a  dynamic  situation,  a  change 
in  the  network  could  allow  the  request  to  be  satisfied;  and 
(2)  removing  the  already  scheduled  transfers  would  require 
restarting  the  scheduling  for  all  data  requests  because  of 
conflicts  that  might  have  occurred. 

6.3.  Full  Path/One  Destination  Procedure 

The  MI  path/one  destination  procedure  uses  a  cost  cri¬ 
terion  to  select  a  data  request  at  an  individual  destination 
machine  for  resource  allocation.  The  data  item  is  then  sent 
from  its  current  location  (machine  M[s]  in  each  of  the  cost 
criteria)  over  as  many  virtual  links  as  required  to  reach  its 
destination  machine  (machine  M[j]  for  one  value  of  j).  For 
costs  C 4,  CAlinks ,  CAsize ,  and  CAsizlnk ,  the  data  item 
Rq[i]  with  minimum  cost  is  sent  first  to  machine  M[r],  and 
if  no  request  was  satisfied,  the  cost  is  applied  a  second  time 
for  the  same  data  item  Rq[i\,  but  setting  the  new  M[s]  (da¬ 
ta  source  machine)  to  the  old  M[r]  (the  machine  to  which 
the  data  was  just  scheduled).  The  minimum  cost  is  then  tak¬ 
en  over  all  values  of  r  (possible  next  storage  locations).  The 
value  of  r  with  minimum  cost  determines  the  machine  M[r] 
that  the  data  is  sent  to  next.  This  process  continues  until  the 
data  item  has  reached  one  requesting  destination  M[j]. 


This  produces  a  communication  schedule  using  fewer 
executions  of  Dijkstra’s  algorithm  than  the  partial  path  pro¬ 
cedure.  The  behavior  of  the  partial  path  procedure  showed 
that  if  a  data  item  Rq[i]  was  selected  for  scheduling  a  trans¬ 
fer  to  its  next  intermediate  location  (a  “hop”),  in  the  follow¬ 
ing  iteration,  the  same  requested  data  item,  Rq[i],  would 
typically  be  selected  again  to  schedule  its  next  hop.  The 
full  path/one  destination  procedure  attempts  to  exploit  this 
trend  by  selecting  a  requested  data  item  with  a  cost  crite¬ 
rion  and  scheduling  all  hops  required  for  the  data  item  to 
reach  its  lowest  cost  destination  before  executing  Dijkstra’s 
algorithm  again. 

The  partial  path  procedure  may  construct  a  partial  path 
(of  many  links)  that  it  later  cannot  complete  (due  to  net¬ 
work  or  memory  resources  being  consumed  by  other  re¬ 
quested  data  items).  However,  until  this  is  determined,  the 
part  of  the  path  constructed  may  block  the  paths  of  the  oth¬ 
er  requested  data  items,  causing  them  to  take  less  optimal 
paths  or  causing  them  to  be  deemed  unsatisfiable.  The  full 
path/one  destination  procedure  avoids  this  problem.  An  ad¬ 
vantage  the  partial  path  approach  does  have  over  the  full 
path/one  destination  approach  is  that  it  allows  the  link-by¬ 
link  assignment  of  each  virtual  link  and  each  machine’s 
memory  capacity  to  be  made  based  on  the  relative  values 
of  the  cost  criteria  for  the  data  items  that  may  want  the  re¬ 
source. 

6.4.  Full  Path/All  Destinations  Procedure 

The  full  path/all  destinations  procedure  resembles  the 
full  path/one  destination  procedure  but  allocates  more  net¬ 
work  resources  after  each  run  of  Dijkstra’s  algorithm.  This 
procedure  satisfies  all  requests  that  would  benefit  from 
sending  data  item  Rq[i]  from  machine  M[s]  to  M[r]  (i.e., 
those  in  the  set  Drq[i,r],  Cost  Cl  is  not  used  in  conjunction 
with  this  procedure  because  it  examines  the  cost  of  only 
one  destination  at  a  time.  This  approach  was  considered 
because  it  was  expected  to  generate  results  comparable  to 
the  full  path/one  destination  procedure,  but  with  a  smaller 
procedure  execution  time. 

7.  Upper  and  Lower  Bounds 
7.1.  Introduction 

Finding  optimal  solutions  to  data  staging  tasks  with  real¬ 
istic  parameter  values  are  intractable  problems.  Therefore, 
it  is  currently  impractical  to  directly  compare  the  quality  of 
the  solutions  found  by  the  proposed  heuristics  with  those 
found  by  exhaustive  searches  in  which  optimal  answers  can 
be  obtained  by  enumerating  all  the  possible  schedules  of 
communication  steps.  Also,  to  the  best  of  the  author’s 
knowledge,  there  is  no  other  work  presented  in  the  open  lit¬ 
erature  that  addresses  the  data  staging  problem  and  presents 
a  heuristic  for  solving  it  (based  on  a  similar  underlying 
model).  Thus,  there  is  no  other  heuristic  for  solving  the 
same  problem  with  which  to  make  a  direct  comparison  of 
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the  heuristics  presented  in  this  document.  To  aid  in  the 
evaluation  of  these  heuristics,  a  lower  bound  and  an  upper 
bound  on  the  performance  of  the  heuristics  are  provided. 

7.2.  Full  Path  Random  Dijkstra 

The  lower  bound  called  the  fud  path  random  Dijkstra 
method  does  take  into  account  which  data  requests  are  sat- 
isfiable  when  it  allocates  resources,  allowing  it  to  improve 
over  the  random  Dijkstra  method  used  in  [20].  It  allocates 
enough  resources  in  one  scheduling  step  to  take  a  data  item 
from  its  current  location  all  the  way  to  one  random  satis- 
fiable  requesting  destination  before  running  Dijkstra’s  al¬ 
gorithm  again.  This  method,  is  based  on  the  full  path/one 
destination  procedure  except  that  the  next  chosen  transfer 
is  randomly  selected,  instead  of  using  a  cost  function.  This 
bound  differs  from  the  single  Dijkstra  random  method  of 
[20]  in  that  (1)  this  method  checks  that  a  requesting  desti¬ 
nation  is  satisfiable  before  allocating  any  resources  toward 
fulfilling  it,  and  (2)  Dijkstra’s  algorithm  is  run  with  updated 
communication  system  information  after  each  scheduling 
step. 

7.3.  Possible  Satisfy  Bandwidth 

The  possible  satisfy  bandwidth  bound  is  a  tighter  bound 
than  the  possible  satisfy  bound  of  [20].  It  considers  sat¬ 
isfiable  requests,  and  also  the  total  amount  of  bandwidth 
available  in  the  system,  Net  Bandwidth.  This  value  is  cal¬ 
culated  by  adding  together  the  number  of  bytes  that  could 
be  transmitted  over  each  virtual  link  in  the  system  during 
the  entire  time  interval  being  simultated.  Consider  the  set 
of  requests  that  would  be  satisfiable  if  each  was  the  only  re¬ 
quest  in  the  system.  Then  the  one  that  has  the  largest  ratio 
of  priority  weight  to  data  item  size  is  selected.  Selecting 
the  request  that  satisfies  this  condition  guarantees  that  if  a 
single  link  is  used  to  satisfy  this  request,  it  will  give  the 
highest  possible  priority  weight  value  per  byte  of  network 
bandwidth  used  as  compared  to  all  requests  remaining  in 
the  system.  Each  time  a  request  is  found,  its  size  in  bytes 
is  added  to  the  bandwidth  used  in  the  system  (this  assumes 
that  only  one  virtual  link  is  needed  to  satisfy  this  request) 
and  its  weighted  priority  is  added  to  the  weights  of  the  oth¬ 
er  data  items  that  have  been  selected.  That  particular  re¬ 
quest  is  then  removed  so  that  a  new  request  can  be  found. 
This  continues  until  the  sum  of  bandwidths  for  the  accept¬ 
ed  requests  exceeds  Net  Bandwidth.  This  upper  bound  is 
unrealistic,  however,  because  it  does  not  take  into  account 
that  more  than  one  link  may  have  to  be  used  to  satisfy  a  re¬ 
quest,  nor  does  it  consider  the  time  intervals  that  links  are 
available,  nor  does  it  consider  what  machines  have  network 
bandwidth  available  between  them. 

8.  Extended  Simulation  Study 
8.1.  Introduction 

After  the  simulation  study  of  [19]  was  completed,  a 
new  study  was  designed  to  examine  the  effects  of  varying 


some  other  parameters  within  the  system.  In  particular,  this 
new  study  introduces  three  new  cost  criteria  and  two  new 
bounds,  and  it  varies  additional  simulation  parameters,  in¬ 
cluding  eight  network  loadings,  three  average  numbers  of 
links  used  to  get  from  a  source  machine  to  a  destination 
machine,  and  five  priority  weighting  schemes. 

The  results  of  [20]  indicated  that  C 4  was  the  best¬ 
performing  cost  criterion.  This  led  to  the  development  of 
cost  criteria  Cisize ,  CAlinks ,  and  CAsizlnk ,  described  in 
Section  5,  for  the  new  study.  Because  of  the  previous  good 
performance  of  the  full  path/one  destination  procedure,  it 
was  implemented  for  the  new  study  with  all  seven  cost  cri¬ 
teria  described  in  Section  5.  For  comparison,  the  other  two 
procedures  in  Section  6  (partial  path  and  full  path/all  desti¬ 
nations)  were  also  implemented  for  the  new  study  with  cost 
(74,  for  a  total  of  nine  heuristics.  (71,  (72,  (73,  (74, 

In  the  previous  study,  all  requests  averaged  traversing 
approximately  1.5  communication  links  (a  communication 
link  traversal  count)  from  an  initial  source  machine  to  a 
requesting  destination  machine.  It  was  decided  that  the 
requests  would  be  generated  in  a  manner  allowing  this  pa¬ 
rameter  to  be  controlled  and  varied  with  three  different  val¬ 
ues  in  the  new  study.  Another  parameter  concerning  the 
data  requests  was  the  number  of  requests  being  made  ver¬ 
sus  the  number  of  requests  that  the  network  could  possibly 
fulfill.  Eight  different  “network  loads”  were  decided  upon 
for  the  new  simulation  study,  as  opposed  to  only  one  in  the 
earlier  [20]  study.  In  combination  with  the  three  communi¬ 
cation  link  traversal  counts,  there  was  a  total  of  24  different 
data  request  scenarios. 

For  this  study,  it  was  decided  that  a  six-level  priority 
scheme  would  be  used  in  place  of  the  three-level  method 
used  in  the  previous  study.  This  was  intended  to  better 
reflect  the  priority  classes  present  in  a  military  environ¬ 
ment.  Level  0  was  generated  with  a  50%  probability,  level 
1  with  25%,  class  2  with  12%,  level  3  with  7%,  level  4 
with  4%,  and  level  5  is  generated  with  a  2%  probability. 
These  percentages  were  selected  to  reflect  the  fact  that  in  a 
BADD/AICE-like  environment,  there  would  likely  only  be 
a  small  number  of  data  requests  in  the  highest  priority  class, 
and  a  large  number  of  data  requests  at  the  lowest  priority 
class. 

The  weighting  of  the  priority  levels  was  changed  to  a 
system  where  the  weight  of  each  priority  level  was  a  fixed 
multiple  of  the  weight  of  the  priority  level  immediately  be¬ 
low  it.  Five  different  values  for  this  multiple  were  used  for 
this  study,  and  each  was  evaluated  with  each  of  the  24  data 
request  scenarios  above,  resulting  in  120  testing  scenarios 
for  evaluation  by  the  79  heuristic/E-U  ratio  combinations. 

As  in  the  previous  study,  40  individual  test  cases 
(each  with  a  unique  network  configuration  and  set  of  da¬ 
ta  requests)  were  generated  for  each  testing  scenario,  be¬ 
cause  a  single  case  cannot  reflect  the  range  of  possible  data 
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Table  1 .  Network  parameters  used  for  the  gen¬ 
eration  of  test  cases. 


parameter 

min.  value 

max.  value 

#  machines 

14 

16 

#  srcs  per  data  item 

1 

3 

#  dests  per  data  item 

1 

5 

src  available  time 

1  sec 

3600  sec 

dest  deadline  delay 

900  sec 

3600  sec 

data  item  size 

10  kBytes 

100  MBytes 

machine  storage 

10  MBytes 

20  GBytes 

#  outbound  links 

1 

4 

link  bw 

10  kBits/sec 

1 .5  M  Bits/sec 

requests  and  network  configurations.  This  resulted  in  the 
379, 200  simulation  runs  described  in  this  section. 


8.2.  Generation  of  Test  Cases 

The  network  parameters  used  to  create  data  sets  for  this 
simulation  study  are  summarized  in  Table  1.  Actual  val¬ 
ues  were  generated  randomly  with  uniform  probability  be¬ 
tween  (and  including)  the  minimum  and  the  maximum  val¬ 
ues  shown  in  the  table.  The  “src  available  time”  is  the  time 
the  data  item  is  available  at  all  of  its  sources  (the  same  time 
for  all  sources  of  that  data  item).  The  “dest  deadline  de¬ 
lay”  is  the  deadline  for  the  requested  data  item  relative  to 
the  time  it  becomes  available  at  its  sources.  These  parame¬ 
ter  values  are  intended  to  be  representative  of  a  subset  of 
a  BADD/AICE-like  environment.  For  more  information 
about  how  these  parameters  are  used  to  generate  the  test 
cases,  the  reader  is  referred  to  [6]. 

For  this  simulation  study,  the  number  of  data  items  gen¬ 
erated  for  a  network  was  700  times  the  number  of  machines 
in  the  network.  After  all  items  were  generated,  Dijkstra’s 
algorithm  was  run  once  for  each  item,  establishing  the  indi¬ 
vidual  satisfiability  of  each  data  item  at  each  requesting  des¬ 
tination  along  with  a  path  of  communication  links  used  to 
reach  each  destination.  The  average  number  of  communica¬ 
tion  links  traversed  from  a  source  machine  to  a  destination 
machine  for  all  of  the  satisfiable  requests  is  the  “resulting 
average  communication  link  traversal  count.”  As  indicated 
above,  three  different  average  link  counts  were  generated 
(1.5, 2.5,  and  3.5),  and  for  each  count,  40  different  networks 
and  associated  data  requests  were  created  with  the  method 
given  above,  resulting  in  a  total  of  120  networks  with  asso¬ 
ciated  data  requests. 

Now  consider  in  the  network  all  data  requests  that  are 
determined  to  be  satisfiable  individually  according  the  first 
execution  of  Dijkstra’s  algorithm.  When  considering  each 
of  these  requests  as  if  it  were  the  only  data  request  in  the 
system,  the  resulting  virtual  link  path  from  Dijkstra’s  algo¬ 
rithm  and  other  known  information  can  be  used  to  calculate 
the  bytes  of  bandwidth  needed  for  each  request.  Then  these 


bandwidths  can  be  summed  to  give  a  value  representing  the 
total  number  of  bytes  of  data  bandwidth  being  requested  in 
the  system.  Call  this  value  ReqBandwidth .  Recall  now 
the  value  NetBandwidth  calculated  by  summing  together 
the  total  number  of  bytes  that  could  be  transmitted  on  each 
of  the  virtual  links  within  the  network  during  the  simula¬ 
tion  period.  An  oversubscription  rate  can  then  be  defined 
as  ReqBandwidth/ NetBandwidth.  If  this  term  is  larger 
than  1,  the  network  can  clearly  not  satisfy  all  requests  due  to 
bandwidth  limitations.  If  the  term  is  less  than  1,  bandwidth 
may  not  exist  between  the  correct  machines  or  may  not  be 
available  during  the  required  time  to  satisfy  all  requests. 

To  examine  system  performance  under  various  request 
loads,  it  was  decided  to  consider  networks  with  the  follow¬ 
ing  oversubscription  rates:  25.0,  12.5,  6.2,  3.1,  1.6,  0.8, 
0.4,  and  0.2.  These  desired  data  sets  were  created  by  start¬ 
ing  with  one  of  the  networks  and  its  associated  set  of  data 
requests,  and  removing  random  data  requests  until  the  de¬ 
sired  oversubscription  rate  was  achieved.  This  did  not  sig¬ 
nificantly  affect  the  average  communication  link  traversal 
counts.  It  resulted  in  data  sets  consisting  of  the  same  net¬ 
work  with  eight  different  oversubscription  rates,  for  each  of 
the  120  networks. 

When  applying  the  heuristics  to  these  test  cases,  a  variety 
of  E-U  ratios  were  used.  For  simulations  run  using  the  full 
path/one  destination  procedure  with  Cisize  and  CAsizlnk , 
the  Ioqio(WeIWu)  used  were  inf,  9,  8,  7,  6,  5,  4, —  inf. 
The  values  of  inf  and  —inf  represent  considering  on¬ 
ly  the  priority  term  (the  term  weighted  by  We),  and 
only  the  urgency  term  (the  term  weighted  by  Wu),  re¬ 
spectively.  For  simulations  run  using  the  partial  path 
procedure  with  C4,  the  full  path/all  destinations  proce¬ 
dure  with  C 4,  and  the  full  path/one  destination  procedure 
with  C4,  and  CAlinks,  the  Ioqiq{We /Wu)  used  were 
inf,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0,  -  inf. 

The  last  parameter  that  was  varied  in  this  simulation 
study  was  the  relative  weight  of  one  level  of  priority  com¬ 
pared  to  another.  With  the  six  priority  levels  of  data 
requests,  the  approach  simulated  was  to  make  the  weight 
of  a  priority  level  a  (where  0  <  a  <  5)  data  request  be 
(i.e.,  W[a)  —  wa)  for  some  fixed  value  of  w.  The  values  of 
uj  simulated  were  1,  2,  4,  8,  and  16,  and  this  was  done  for 
each  of  the  networks  and  loadings  mentioned  above.  The 
results  of  the  simulations  using  these  parameters  are  now 
presented. 

8.3.  Evaluation  of  Simulations 

Heuristic  and  bound  labels  used  in  the  graphs  at  the  end 
of  this  subsection  are  summarized  in  Table  2.  As  stated 
previously,  because  of  the  good  performance  of  the  full 
path/one  destination  procedure  and  Cost  C4,  they  were  the 
focus  of  these  new  experiments.  The  three  new  costs  taking 
into  account  data  item  size  and  the  number  of  communica¬ 
tion  links  traversed  from  a  source  to  a  destination  are  shown 
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in  Figure  1 .  The  peak  performance  of  the  costs  taking  da¬ 
ta  item  size  into  account  are  further  to  the  right  (signifying 
higher  E-U  ratios)  in  the  graph  because  those  costs  divide 
the  effective  priority  term  by  the  data  item  size.  Due  to 
space  constraints,  only  a  subset  of  the  results  from  [6]  ap¬ 
pears  in  this  paper. 

In  Figures  2  through  5,  the  data  points  for  the  heuristics 
used  correspond  to  the  best  E-U  ratio  for  each  testing  sce¬ 
nario  (for  u  ^  1,  this  is  a  combination  of  the  priority  and 
urgency  terms).  The  values  for  the  normalized  vertical  axis 
in  all  of  these  graphs  is  computed  as  follows.  For  each  test 
case,  the  sum  of  the  satisfied  requests’  weighted  priorities 
for  a  given  heuristic  or  bound  is  divided  by  the  sum  of  satis¬ 
fied  requests’  weighted  priorities  given  by  the  best  E-U  ratio 
for  fulLone_C4.  This  normalized  sum  is  then  averaged  over 
the  40  network  test  cases  to  give  the  final  value  for  each  data 
point. 

The  relative  performance  of  the  heuristics  are  shown  in 
Figures  2  and  3.  The  costs  considering  data  item  size  will 
tend  to  allocate  resources  for  all  of  the  smaller  data  items 
first,  resulting  in  many  small  time  intervals  of  link  band¬ 
width  being  allocated  initially.  In  the  lightly  loaded  cases, 
the  remainder  of  the  link  bandwidth  must  be  used  by  larger 
data  items,  but  continuously  available  links  may  not  exist 
for  a  long  enough  period  of  time  for  these  larger  data  items 
to  use.  In  the  more  heavily  loaded  network  cases,  there  are 
enough  smaller  data  items  available  to  make  use  of  all  of 
the  network  bandwidth  without  sending  any  of  the  larger 
data  items.  The  resulting  trend  is  that  the  costs  incorporat¬ 
ing  data  item  size  have  a  relative  decrease  in  performance 
for  lightly  oversubscribed  networks,  followed  by  a  relative 
increase  in  performance  for  the  heavily  oversubscribed  net¬ 
works. 

There  is  a  general  overall  trend  that  as  uj  increases  (and 
other  factors  are  fixed),  the  performance  of  all  heuristics  is 
closer  to  each  other.  This  is  because  more  of  the  total  sum 
of  priority  weights  of  requests  in  the  system  is  contributed 
by  a  few  highest  priority  requests. 

The  method  fulLone_C41inks,  performed  very  compa- 


Table  2.  Labels  for  heuristics  and  bounds 
used  in  the  graphs  of  Section  8.3. 


heuristic  combination 

label  used 

partial  path  w/  (74 

partial.C4 

full  path/one  dest.  w/  C4 

full.one.C4 

full  path/one  dest.  w/  C4links 

fulLone_C4links 

full  path/one  dest.  w/  C4size 

full_one.C4size 

full  path/one  dest.  w/  C4sizlnk 

fulLone.C4sizlnk 

full  path/all  dest.  w/  C4 

fulLall.C4 

possible  satisfy  bandwidth 

possible.satisfy-bw 

full  path  random  Dijkstra 

fulLrand.Dijkstra 

Figure  1 .  Sample  graph  of  the  effect  of  varying 
the  E-U  ratio  (WE/Wu).  The  data  sets  used 
had  an  average  link  traversal  count  of  2.5,  a 
request  over-subscription  rate  of  3.1,  and  an 
uj  value  of  4. 

rably  to  full_one_C4  in  all  tests.  There  was  no  situa¬ 
tion  indicated  by  these  simulations  where  fulLone.C41inks 
should  be  chosen  over  fulLone„C4,  or  vice  versa.  The  par¬ 
tial  _C4  method  was  also  shown  to  perform  comparably  to 
the  full_one_C4  method  in  all  cases. 

The  full_alLC4  method  is  shown  to  perform  well  for 
small  oversubscription  rate,  but  as  the  oversubscription  rate 
increases  a  clear  decrease  in  performance  is  seen.  This 
is  due  to  the  full  path/all  destinations  procedure  allocat¬ 
ing  resources  for  more  than  one  destination  simultaneously, 
where  some  requesting  destinations  may  have  very  low  pri¬ 
ority. 

Table  3  shows  the  average  number  of  requests  satisfied 
at  each  priority  level  by  full_one_C4  as  compared  to  a  sim¬ 
ple  algorithm  that  schedules  all  requests  of  a  higher  priority 
level  before  any  requests  of  a  lower  priority  level.  In  par¬ 
ticular,  this  algorithm  was  fulLone_Cl  with  Wy  =  0.  For 
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request  oversubscription  rate 


Figure  2.  Weighted  sum  of  satisfied 
requests’  priorities  normalized  at  each 
over-subscription  rate  to  the  performance  of 
fulLoneX4.  The  data  sets  had  an  average 
link  traversal  count  of  2.5  and  an  u  value  of  4. 


oj  >  1  in  these  tables,  more  requests  in  the  top  three  pri¬ 
ority  levels  are  being  satisfied  by  fulLone„C4  (which  obeys 
the  relative  importance  assigned  to  each  of  the  priority  lev¬ 
els  set  by  the  policy  maker)  than  the  level  by  level  method 
(which  ignores  these  policy  requirements).  The  number  of 
satisfied  requests  at  the  top  priority  level  remains  compa¬ 
rable  for  full_one_C4  and  u)  >  1  because  there  are  so  few 
requests  at  that  level  that  all  are  able  to  be  satisfied.  This 
is  indicated  by  the  fact  that  the  level  by  level  method  can¬ 
not  satisfy  any  more  of  the  top  priority  requests.  For  ex¬ 
ample,  even  though  the  level  by  level  method  schedules  all 


request  oversubscription  rate 

Figure  3.  Weighted  sum  of  satisfied 
requests’  priorities  normalized  at  each 
over-subscription  rate  to  the  performance  of 
full_one_C4.  The  data  sets  had  an  average 
link  traversal  count  of  2.5  and  an  u>  value  of  4. 


priority  level  5  requests  as  if  they  were  the  only  requests 
in  the  system,  the  total  number  scheduled  does  not  exceed 
the  results  of  full_one_C4  (for  u  >  1).  This  shows  that 
fulLone_C4  using  urgency  in  addition  to  effective  priority, 
is  better  than  full_one_Cl  without  urgency.  Furthermore, 
full_one_C4  results  in  a  higher  sum  of  weighted  priorities  of 
satisfied  requests  than  the  level  by  level  method  in  almost 
all  cases  considered  ih  Table  3. 

In  summary,  a  class  of  heuristics  that  compare  well  to 
upper  and  lower  bounds  has  been  developed  and  analyzed. 
Many  heuristics  perform  within  a  few  percentage  points  of 
each  other,  and  this  is  why  it  is  important  to  also  consider 
the  execution  times  of  the  different  approaches.  Further¬ 
more,  while  in  general  several  heuristics  perform  compara¬ 
bly,  if  a  system  is  known  to  have  a  particular  operating  en¬ 
vironment  (e.g.,  u  value,  oversubscription  rate),  there  may 
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Table  3.  Number  requests  satisfied  at  each 
priority  level  by  fulLone_C4  with  an  average 
link  traversal  count  of  2.5  and  an  oversub¬ 
scription  rate  of  1.6.  The  “level  by  level”  col¬ 
umn  shows  the  effect  of  allocating  resources 
for  all  priority  class  a  requests  before  all  pri¬ 
ority  class  0  requests  where  a  >  /?. 


priority 

level 

number  requests  satisfied 

U) 

level  by 
level 

1 

2 

4 

8 

16 

5 

4.2 

8.3 

8.4 

8.4 

8.4 

8.2 

4 

9.8 

16.0 

16.1 

16.1 

16.1 

16.0 

3 

16.4 

23.4 

23.4 

23.1 

23.5' 

22.8 

2 

28.8 

33.8 

31.2 

27.0 

32.5 

32.5 

1 

57.6 

50.5 

44.8 

43.1 

43.0 

50.1 

0 

118.8 

81.6 

82.0 

85.9 

84.9 

78.6 

be  a  preference  for  one  heuristic  over  another.  Confidence 
intervals  for  some  of  the  data  points  generated  by  test  cases 
in  this  section  can  be  found  in  [6]. 


9.  Data  Items  With  Multiple  Versions 

9.1.  Approach 

In  this  section,  a  variable  time,  variable  accuracy  algo¬ 
rithm  will  be  presented  to  deal  with  data  items  with  a  higher 
quality  and  lower  quality  version,  as  mentioned  in  Section 
3.  The  higher  quality  data  item  is  assumed  for  simplicity  to 
be  twice  the  size  of  the  lower  quality  data  item.  The  higher 
quality  data  item,  however,  has  four  times  as  much  “worth” 
to  the  end  user  as  the  lower  quality  data  item.  This  worth 
was  chosen  to  indicate  that  the  system  should  be  penalized 
for  selecting  the  lower  quality  data  item  over  the  higher  one. 

The  approach  used  to  incorporate  these  lower  quality  da¬ 
ta  item  versions  into  the  developed  heuristics  was  to  create 
an  iterative  algorithm  that  attempts  to  create  a  new  sched¬ 
ule  Sh  with  each  iteration  that  has  a  smaller  effect  E[Sh ]. 
In  the  first  iteration,  only  the  higher  quality  versions  of  the 
data  items  are  considered  satisfiable  by  the  value  Sat[i ,  fc] 
(where  0  <  i  <  2p  and  0  <  k  <  Nrq[i}).  That  is, 
Sat[i,k]  (from  the  cost  criteria  of  Section  5)  can  only  be 
1  if  0  <  i  <  p.  A  heuristic  is  then  used  with  Dijkstra’s 
algorithm  to  create  a  complete  schedule  of  data  transfers, 
which  corresponds  to  the  research  described  in  Section  8. 

After  the  first  iteration  schedule  has  been  determined,  the 
value  of  Sat [j,  k]  (where  0  <  j  <  p)  for  the  second  iter¬ 
ation  is  only  allowed  to  be  1  if  Request  [j,  k]  was  satisfied 
in  the  previous  iteration.  The  value  of  Sat  [j  +  p,  k]  is  then 
only  allowed  to  be  1  if  Request  [j,  k]  was  not  satisfied  in 
the  previous  iteration.  A  complete  new  schedule  is  created 
using  a  heuristic  with  Dijkstra’s  algorithm.  That  is,  if  dur¬ 
ing  iteration  one  a  requesting  destination  does  not  receive 
its  higher  quality  requested  data  item,  then  in  the  second  it¬ 


eration,  it  will  request  the  lower  quality  version  of  that  data 
item  instead.  The  schedule  produced  by  the  second  itera¬ 
tion  will  then  likely  satisfy  at  least  a  few  lower  quality  data 
item  requests  (of  higher  priority)  in  place  of  higher  quality 
data  item  requests  (of  lower  priority).  The  higher  quality 
data  item  requests  that  are  not  satisfied  in  the  second  itera¬ 
tion  then  request  their  respective  lower  quality  versions  for 
the  third  iteration.  This  iterative  process  can  be  repeated 
as  many  times  as  allotted  execution  time  permits,  and  can 
stop  at  any  time  after  the  first  iteration  and  output  the  best 
schedule  that  it  has  generated  thus  far.  (This  assumes  that 
the  best  schedule  is  kept  separately  after  each  iteration  and 
that  the  last  iteration  performance  may  not  result  in  the  best 
schedule.) 

9.2.  Costs  Cl,  C2,  and  C3 

Figures  4  and  5  include  the  fulLone  procedure  with  costs 
Cl,  C2,  C3,  and  C4,  where  iteration  1  corresponds  to  the 
situation  without  consideration  of  versions.  The  full  sets  of 
experiments  in  [6]  gave  insights  into  the  behaviors  of  these 
costs. 

The  fulLone.Cl  heuristic  performs  well  except  for  the 
highest  oversubscription  rate  test  cases.  Because  cost  Cl 
only  considers  the  benefit  of  moving  data  to  satisfy  a  single 
request,  this  suggests  that  in  very  highly  oversubscribed  net¬ 
works,  it  helps  to  consider  multiple  requesting  destinations 
that  would  collectively  benefit  from  a  data  transfer. 

The  full_one_C2  method  appears  to  suffer  from  its  choice 
of  destination  machines;  specifically,  it  allows  a  single  des¬ 
tination’s  urgency  (instead  of  a  collective  view  of  the  urgen¬ 
cies)  to  affect  which  valid  next  step  is  selected.  As  system 
oversubscription  rates  increase,  its  relative  performance  de¬ 
creases. 

The  full_one_C3  method  performs  consistently  poorly 
for  heavily  oversubscribed  networks.  Its  performance  in  the 
simulation  studies  of  [20]  indicated  that  it  would  not  like¬ 
ly  perform  well,  so  this  was  expected.  It  is  interesting  to 
note,  however  that  as  cj  was  increased,  the  relative  perfor¬ 
mance  of  full_one_C3  increased  as  well.  This  suggests  that 
the  problem  with  cost  C 3  is  indeed  due  to  allowing  the  ur¬ 
gency  factor  to  dominate  the  cost  equation,  because  as  the 
priority  weight  is  increased,  it  begins  to  perform  well.  This 
is  especially  true  for  the  lower  oversubscription  rates. 

9.3.  Evaluation  of  Simulations 

The  data  sets  used  for  these  experiments  were  a  subset 
of  the  data  sets  created  for  the  simulation  study  of  Section 
8.  For  very  light  loading  (i.e.,  0.2),  all  of  the  heuristics  per¬ 
form  similarly  after  the  second  iteration.  Only  the  data  sets 
with  average  link  traversal  counts  of  2.5  were  used.  Five 
iterations  of  the  variable  accuracy  algorithm  were  run.  Re¬ 
sults  from  those  runs  under  different  loads  is  shown  in  Fig¬ 
ures  2  and  3,  where  Figure  2  includes  the  upper  and  lower 
bounds.  It  should  be  noted  that  each  graph  is  normalized  to 


86 


Figure  4.  Weighted  sum  of  satisfied  requests’ 
priorities  normalized  to  the  performance  of 
fulLone_C4  in  iteration  1.  The  data  set  had  an 
oversubscription  rate  of  0.8,  an  average  link 
traversal  count  of  2.5,  and  an  u  value  of  4. 

the  performance  of  full.one_C4  at  the  end  of  its  first  itera¬ 
tion,  which  is  the  same  as  the  performance  of  full_one_C4 
in  the  study  of  Section  8. 

For  less  oversubscribed  networks,  the  heuristics  are  al¬ 
most  all  able  to  increase  their  own  respective  performance 
with  additional  iterations  (e.g.,  Figure  4).  For  more  over¬ 
subscribed  networks,  this  is  not  generally  the  case  (e.g.,  Fig¬ 
ure  5).  All  of  the  cost  criteria  used  here  except  Cl  consider 
more  than  one  destination  as  part  of  the  cost  of  sending  a 
data  item  to  its  next  machine.  The  implementation  of  the 
multiple  versions  approach  works  against  this,  particularly 
at  higher  oversubscription  rates.  For  example,  in  iteration 
one,  multiple  requests  may  contribute  to  the  overall  sum 
for  a  transfer  to  destination  d\ .  When  using  multiple  ver¬ 
sions,  the  destinations  that  receive  the  second  version  will 
no  longer  contribute  to  the  sum  for  destination  di .  Because 
of  this,  destination  d\ ’s  request  may  no  longer  have  a  large 


Figure  5.  Weighted  sum  of  satisfied  requests’ 
priorities  normalized  to  the  performance  of 
full_one_C4  in  iteration  1 .  The  data  set  had  an 
oversubscription  rate  of  3.1,  an  average  link 
traversal  count  of  2.5,  and  an  ui  value  of  4. 

enough  sum  to  obtain  network  resources.  For  this  reason, 
full.one.Cl  (which  does  not  collectively  consider  multiple 
requesting  destinations)  is  less  inclined  to  decrease  in  per¬ 
formance  in  successive  iterations  after  the  second  iteration. 

An  additional  reason  for  a  lack  of  improvement  after 
each  iteration  for  data  sets  with  high  oversubscription  rates 
is  related  to  the  large  number  of  requests  of  high  priority  in 
the  system.  There  are  already  very  many  data  items  in  these 
tests  with  a  desirable  priority  to  select  from,  and  the  sec¬ 
ondary  versions  of  data  items  are  not  any  better  of  a  choice 
than  any  of  the  primary  versions  of  data  items  that  are  avail¬ 
able. 

In  summary,  the  use  of  multiple  versions  will  help  some 
heuristics  improve  the  sum  of  priorities  satisfied  in  all  but 
the  most  oversubscribed  cases.  The  improvement  obtained 
in  some  operator  environments  exceeds  10%.  In  almost  all 
cases,  the  best  improvement  is  given  by  the  second  iteration 
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of  the  variable  time,  variable  accuracy  algorithm. 

10.  Summary  and  Conclusions 

Data  staging  is  an  important  data  management  issue  for 
distributed  computer  systems.  It  addresses  the  issues  of 
distributing  and  storing  over  numerous  geographically  dis¬ 
persed  locations  both  repository  data  and  continually  gener¬ 
ated  data  through  an  oversubscribed  network,  where  not  all 
data  requests  can  be  satisfied.  When  certain  data  with  their 
corresponding  priorities  need  to  arrive  at  a  site  with  limit¬ 
ed  storage  capacities  in  a  timely  fashion,  a  heuristic  must 
be  devised  to  schedule  the  necessary  communication  steps 
efficiently. 

The  performance  of  nine  heuristics  were  shown,  and 
compared  to  an  upper  bound  and  a  lower  bound.  Many 
different  weighting  schemes  for  the  relative  importance  of 
different  priority  levels  of  requested  data  items  were  con¬ 
sidered.  Each  procedure  and  cost  criterion  was  designed 
with  particular  advantages  in  mind.  The  results  present¬ 
ed  showed  that,  for  the  system  parameters  considered  (e.g., 
priority  weighting,  oversubscription  rate),  the  combination 
of  cost  C4  or  Cl  with  the  full  path/one  destination  pro¬ 
cedure  and  C 4  with  the  partial  path  procedure  consistent¬ 
ly  performed  the  best,  when  using  the  measure  of  weight¬ 
ed  sum  of  priorities  satisfied.  Because  each  heuristic  has 
advantages,  the  procedure/cost  criterion  pair  that  perform- 
s  best  may  differ  depending  on  the  system  parameters  (i.e., 
the  actual  environment  where  the  scheduler  heuristic  will  be 
deployed). 

An  additional  novel  approach  using  a  variable  time,  vari¬ 
able  accuracy  method  that  considered  multiple  data  item 
versions  with  different  resource  requirements  was  evaluat¬ 
ed.  The  use  of  multiple  versions  was  shown  to  help  some 
heuristics  in  all  but  the  most  oversubscribed  cases;  in  many 
cases,  the  improvement  was  over  10%. 

In  summary,  a  class  of  heuristics  and  cost  criteria  that 
compare  well  to  upper  and  lower  bounds  were  developed 
and  analyzed.  While  in  general  several  heuristics  perform 
comparably,  if  a  system  is  known  to  have  a  particular  op¬ 
erating  environment  (e.g.,  u)  value,  oversubscription  rate), 
there  may  be  a  preference  for  one  pair  over  another. 
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Abstract 

As  distributed  applications  have  become  more  widely 
used,  they  more  often  need  to  leverage  the  comput¬ 
ing  power  of  a  heterogeneous  network  of  computer 
architectures.  Modern  communications  libraries  pro¬ 
vide  mechanisms  that  hide  at  least  some  of  the  com¬ 
plexities  of  binary  data  interchange  among  heteroge¬ 
neous  machines.  However,  these  mechanisms  may  be 
cumbersome,  requiring  that  communicating  applica¬ 
tions  agree  a  priori  on  precise  message  contents,  or 
they  may  be  inefficient,  using  both  “up”  and  “down” 
translations  for  binary  data.  Finally,  the  seman¬ 
tics  of  many  packages,  particularly  those  which  re¬ 
quire  applications  to  manually  “pack”  and  “unpack” 
messages,  result  in  multiple  copies  of  message  data, 
thereby  reducing  communication  performance.  This 
paper  describes  PBIO,  a  novel  messaging  middleware 
which  offers  applications  significantly  more  flexibility 
in  message  exchange  while  providing  an  efficient  im¬ 
plementation  that  offers  high  performance. 

1  Introduction 

As  distributed  applications  have  become  more 
widely  used,  they  often  need  to  leverage  the  comput¬ 
ing  power  of  a  heterogeneous  network  of  computer 
architectures.  Modern  communications  libraries  pro¬ 
vide  mechanisms  that  hide  at  least  some  of  the  com¬ 
plexities  of  binary  data  interchange  among  heteroge¬ 
neous  machines.  The  features  and  semantics  of  these 
packages  are  typically  a  compromise  between  what 
might  be  useful  to  the  applications  and  what  can  be 
implemented  efficiently. 

For  example,  many  packages,  such  as  PVM[8]  and 
Nexus[7],  support  message  exchanges  in  which  the 
communicating  applications  “pack”  and  “unpack” 
messages,  building  and  decoding  them  field  by  field, 
datatype  by  datatype.  Other  packages,  such  as 
MPI[6],  allow  the  creation  of  user-defined  datatypes 
for  messages  and  message  fields  and  provide  some 


amount  of  marshalling  and  unmarshalling  support  for 
those  datatypes  internally. 

The  approach  of  requiring  the  application  to  build 
messages  manually  offers  applications  significant  flex¬ 
ibility  in  message  contents  while  ensuring  that  the 
pack  and  unpack  operations  are  performed  by  opti¬ 
mized,  compiled  code.  However,  relegating  message 
packing  and  unpacking  to  the  communicating  ap¬ 
plications  means  that  those  applications  must  have 
a  priori  agreement  on  the  contents  and  format  of 
messages.  This  is  not  an  onerous  requirement  in 
small-scale  stable  systems,  but  in  enterprise-scale  dis¬ 
tributed  computing,  the  need  to  simultaneously  up¬ 
date  all  application  components  in  order  to  change 
message  formats  can  be  a  significant  impediment  to 
the  integration,  deployment  and  evolution  of  complex 
systems. 

In  addition,  the  semantics  of  application-side  pack/ 
unpack  operations  generally  imply  a  data  copy  to  or 
from  message  buffers.  Such  copies  are  known [11,  13] 
to  have  a  significant  impact  on  communication  sys¬ 
tem  performance.  Packages  which  can  perform  inter¬ 
nal  marshalling,  such  as  MPI,  have  an  opportunity  to 
avoid  data  copies  and  to  offer  more  flexible  semantics 
in  matching  fields  provided  by  senders  and  receivers. 
However,  existing  packages  have  failed  to  capitalize 
on  those  opportunities.  For  example,  MPIs  type¬ 
matching  rules  require  strict  a  priori  agreement  on 
the  contents  of  messages.  Additionally,  most  MPI  im¬ 
plementations  implement  marshalling  of  user-defined 
datatypes  via  mechanisms  that  amount  to  interpreted 
versions  of  field- by-field  packing. 

This  paper  describes  PBIO  (Portable  Binary  Input/ 
Output)  [3],  a  multi-purpose  communication  middle¬ 
ware.  In  developing  PBIO  we  have  not  attempted  to 
recreate  various  higher-level  communication  abstrac¬ 
tions  offered  by  MPI  or  by  the  Remote  Service  Re¬ 
quests  of  Nexus.  Instead,  we  provide  flexible  hetero¬ 
geneous  binary  data  transport  for  simple  messaging 
of  a  wide  range  of  application  data  structures,  using 
novel  approaches  such  as  dynamic  code  generation 
(DCG)  to  preserve  efficiency.  In  addition,  PBIO’s 
flexibility  in  matching  transmitted  and  expected  data 
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types  provides  key  support  for  application  evolution 
that  is  missing  from  other  communication  systems. 

This  paper  briefly  describes  PBIO  semantics  and 
features,  and  then  illustrates  performance  metrics 
across  a  heterogeneous  environment  of  Sun  Sparc  and 
X86-based  machines  running  Solaris.  These  metrics 
are  compared  against  the  data  communication  mea¬ 
surements  obtained  by  using  MPI  as  a  data  commu¬ 
nication  mechanism  across  the  same  network  archi¬ 
tecture.  The  paper  will  show  that  the  features  and 
flexibility  of  PBIO  do  not  impose  overhead  beyond 
that  imposed  by  other  communications  systems.  In 
the  worst  case  PBIO  performs  as  well  as  other  sys¬ 
tems,  and  in  many  cases  PBIO  offers  a  significant 
performance  improvement  over  comparable  commu¬ 
nications  packages. 

Much  of  PBIO’s  performance  advantage  is  due  to 
its  use  of  dynamic  code  generation  to  optimize  trans¬ 
lations  from  wire  to  native  format.  Because  this  is 
a  novel  feature  in  communications  middleware,  its 
impact  on  PBIO’s  performance  is  also  considered  in¬ 
dependently.  In  this  manner,  we  show  that  for  pur¬ 
poses  of  data  compatibility,  PBIO,  along  with  code 
generation,  can  provide  reliable,  high  performance, 
easy-to-use,  easy-to-migrate,  heterogeneous  support 
for  distributed  applications. 

2  The  PBIO  Communication  Library 

In  order  to  conserve  I/O  bandwidth  and  reduce 
storage  and  processing  requirements,  storing  and 
transmitting  data  in  binary  form  is  often  desirable. 
However,  transmission  of  binary  data  between  hetero¬ 
geneous  environments  has  been  problematic.  PBIO 
was  developed  as  a  portable  self-describing  binary 
data  library,  providing  both  stream  and  file  support 
along  with  data  portability. 

The  basic  approach  of  the  Portable  Binary  I/O  li¬ 
brary  is  straightforward.  PBIO  is  a  record-oriented 
communications  medium.  Writers  of  data  must  pro¬ 
vide  descriptions  of  the  names,  types,  sizes  and  po¬ 
sitions  of  the  fields  in  the  records  they  are  writ¬ 
ing.  Readers  must  provide  similar  information  for  the 
records  they  wish  to  read.  No  translation  is  done  on 
the  writer’s  end,  our  motivation  being  to  offload  pro¬ 
cessing  from  data  providers  (e.g.,  servers)  whenever 
possible.  On  the  reader’s  end,  the  format  of  the  in¬ 
coming  record  is  compared  with  the  format  expected 
by  the  program.  Correspondence  between  fields  in 
incoming  and  expected  records  is  established  by  field 
name,  with  no  weight  placed  on  size  or  ordering  in 
the  record.  If  there  are  discrepancies  in  field  size  or 
placement,  then  PBIO’s  conversion  routines  perform 


the  appropriate  translations.  Thus,  the  reader  pro¬ 
gram  may  read  the  binary  information  produced  by 
the  writer  program  despite  potential  differences  in: 
(1)  byte  ordering  on  the  reading  and  writing  archi¬ 
tectures;  (2)  differences  in  sizes  of  data  types  (e.g. 
long  and  int) ;  and  (3)  differences  in  structure  layout 
by  compilers. 

Since  full  format  information  for  the  incoming 
record  is  available  prior  to  reading  it,  the  receiv¬ 
ing  application  can  make  run-time  decisions  about 
the  use  and  processing  of  incoming  messages  about 
whom  it  had  no  a  priori  knowledge.  However,  this 
additional  flexibility  comes  with  the  price  of  poten¬ 
tially  complex  format  conversions  on  the  receiving 
end.  Since  the  format  of  incoming  records  is  prin¬ 
cipally  defined  by  the  native  formats  of  the  writers 
and  PBIO  has  no  a  priori  knowledge  of  the  native 
formats  used  by  the  program  components  with  which 
it  might  communicate,  the  precise  nature  of  this  for¬ 
mat  conversion  must  be  determined  at  run-time. 

Since  high  performance  applications  can  ill  afford 
the  increased  communication  costs  associated  with 
interpreted  format  conversion,  PBIO  uses  dynamic 
code  generation  to  reduce  these  costs.  The  cus¬ 
tomized  data  conversion  routines  generated  must  be 
able  to  access  and  store  data  elements,  convert  el¬ 
ements  between  basic  types  and  call  subroutines  to 
convert  complex  subtypes.  Measurements^]  show 
that  the  one-time  costs  of  DCG,  and  the  perfor¬ 
mance  gains  by  then  being  able  to  leverage  com¬ 
piled  (and  compiler-optimized)  code,  far  outweigh  the 
costs  of  continually  interpreting  data  formats.  The 
analysis  in  the  following  section  shows  that  DCG, 
together  with  native-format  data  transmission  and 
copy  reduction,  allows  PBIO  to  provide  its  additional 
type-matching  flexibility  without  negatively  impact¬ 
ing  performance.  In  fact,  PBIO  outperforms  our 
benchmark  communications  package  in  all  measured 
situations. 

3  Evaluation 

In  order  to  thoroughly  evaluate  PBIO’s  perfor¬ 
mance  and  its  utility  in  high-performance  commu¬ 
nication,  we  present  a  variety  of  measurements  in 
different  circumstances.  Where  possible,  we  com¬ 
pare  PBIO’s  performance  to  the  cost  of  similar  op¬ 
erations  in  MPI.  Additionally,  we  include  measure¬ 
ments  which  evaluate  PBIO’s  performance  in  situ¬ 
ations  which  are  not  supported  by  other  communi¬ 
cations  packages.  In  particular,  we  evaluate  PBIO’s 
support  for  application  evolution  and  its  ability  to 
transmit  dynamically  sized  data  elements. 
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Figure  1:  Cost  breakdown  for  message  exchange. 


3.1  Analysis  of  costs  in  heterogeneous 
data  exchange 

Before  analyzing  PBIO  costs  in  detail,  it  is  useful  to 
examine  the  costs  in  an  exchange  of  binary  data  in 
a  heterogeneous  environment.  As  a  baseline  for  this 
discussion,  we  use  the  MPICH[10]  implementation  of 
MPI,  a  popular  messaging  package  in  cluster  comput¬ 
ing  environments.  Figure  1  represents  a  breakdown 
of  the  costs  in  an  MPI  message  round-trip  between 
a  x86-based  PC  and  a  Sun  Sparc  connected  by  100 
Mbps  Ethernet.1  The  time  components  labeled  “En¬ 
code”  represent  the  time  span  between  the  applica¬ 
tion  invoking  MPI_send()  and  the  eventual  call  to 
write  data  on  a  socket.  The  “Decode”  component 
is  the  time  span  between  the  recv()  call  returning 
and  the  point  at  which  the  data  is  in  a  form  us¬ 
able  by  the  application.  In  generating  these  num¬ 
bers  network  transmission  times  were  measured  with 
NetPerf[9]  and  send  and  receive  times  were  measured 
by  substituting  dummy  calls  for  socket  send()  and 
recv().  This  delineation  allows  us  to  focus  on  the  en¬ 
code/decode  costs  involved  in  binary  data  exchange. 
That  these  costs  are  significant  is  clear  from  the  fig¬ 
ure,  where  they  typically  represent  66%  of  the  total 
cost  of  the  exchange. 

Figure  1  shows  the  cost  breakdown  for  messages  of 
a  selection  of  sizes,  but  in  practice,  message  times  de- 


iThe  Sun  machine  is  an  Ultra  30  with  a  247  MHz  cpu  run¬ 
ning  Solaris  7.  The  x86  machine  is  a  450  MHz  Pentium  II,  also 
running  Solaris  7. 


pend  upon  many  variables.  Some  of  these  variables, 
such  as  basic  operating  system  characteristics  that 
affect  raw  end-to-end  TCP/IP  performance,  are  be¬ 
yond  the  control  of  the  application  or  the  communica¬ 
tion  middleware.  Different  encoding  strategies  in  use 
by  the  communication  middleware  may  change  the 
number  of  raw  bytes  transmitted  over  the  network, 
but  those  differences  tend  to  be  negligible.  There¬ 
fore,  the  remainder  of  our  analysis  will  concentrate 
on  the  more  controllable  sending  side  and  receiving 
side  costs. 

Another  application  characteristic  which  has  a 
strong  effect  upon  end-to-end  message  exchange  time 
is  the  precise  nature  of  the  data  to  be  sent  in  the 
message.  It  could  be  a  contiguous  block  of  atomic 
data  elements  (such  as  an  array  of  floats),  a  stride- 
based  element  (such  as  a  stripe  of  a  homogeneous  ar¬ 
ray),  a  structure  containing  a  mix  of  data  elements, 
or  even  a  complex  pointer-based  structure.  MPI,  de¬ 
signed  for  scientific  computing,  has  strong  facilities 
for  homogeneous  arrays  and  strided  elements.  MPIs 
support  for  structures  is  less  efficient  than  its  sup¬ 
port  for  contiguous  arrays  of  atomic  data  elements, 
and  it  doesn’t  attempt  to  supported  pointer-based 
structures  at  all.  PBIO  doesn’t  attempt  to  support 
strided  array  access,  but  otherwise  supports  all  types 
with  equal  efficiency,  including  a  non-recursive  sub¬ 
set  of  pointer-based  structures.  The  message  type  of 
the  100Kb  message  in  Figure  1  is  a  non-homogeneous 
structure  taken  from  the  messaging  requirements  of 
a  real  application,  a  mechanical  engineering  simula- 
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Figure  2:  Strings  and  dynamic  arrays  in 
PBIO. 
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tion  of  the  effects  of  micro-structural  properties  on 
solid-body  behavior.  The  smaller  message  types  are 
representative  subsets  of  that  mixed-type  message. 

The  next  sections  will  examine  PBIO’s  costs  in  ex¬ 
changing  the  same  sets  of  messages.  Subsequently, 
Section  3.5  will  examine  costs  for  other  data  types. 

3.2  Sending  side  cost 

As  is  mentioned  in  Section  2,  PBIO  transmits  data 
in  the  native  format  of  the  sender.  No  copies  or  data 
conversions  are  necessary  to  prepare  simple  struc¬ 
ture  data  for  transmission.  So,  while  MPICH’s  costs 
to  prepare  for  transmission  on  the  Sparc  vary  from 
34^sec  for  the  100  byte  record  up  to  13  msec  for  the 
100Kb  record,  PBIO’s  cost  is  a  flat  3  ^sec.  Of  course, 
this  efficiency  is  accomplished  by  moving  most  of  the 
complexity  to  the  receiver,  where  Section  3.3  tells  a 
more  complex  story. 

As  mentioned  above,  PBIO  also  supports  the  trans¬ 
mission  of  some  pointer-based  structures.  In  partic¬ 
ular,  PBIO  allows  an  element  of  a  structure  be  to  a 
null-terminated  string,  or  a  pointer  to  a  dynamically 
sized  array,2  as  shown  in  Figure  2.  The  array  ele¬ 
ments  may  be  of  an  atomic  data  type  or  a  previously 
registered  structure.  That  there  is  no  forward  dec¬ 
laration  mechanism  or  self-referentiality  for  structure 
types  restricts  PBIO  from  describing  such  things  as 
linked  lists.  However,  relatively  complex  structures, 
such  as  the  one  depicted  in  Figure  3  can  be  directly 
transmitted.  The  ability  to  directly  transmit  dynam¬ 
ically  sized  arrays  is  a  feature  that  is  not  normally 
present  in  communications  middleware. 

Unlike  contiguous  structures,  pointer-based  enti¬ 
ties  do  require  some  preparation  before  they  are  sent. 
In  particular,  PBIO  must  walk  the  structure  to  1) 
prepare  a  transmission  list  of  data  blocks  and  their 
lengths,  and  2)  change  all  internal  pointers  from  ad¬ 
dresses  to  offsets  within  the  message.  The  type  se- 

2  In  the  case  of  a  dynamically  sized  array,  the  array  size 
must  be  given  by  another,  integer-typed,  element  in  the  base 
structure. 


Figure  3:  A  multi-level  pointer  structure  that 
can  be  transmitted  by  PBIO. 

mantics  ensures  that  there  can  be  no  circularities 
in  the  structure,  so  the  'walk’  is  a  simple  tree  de¬ 
scent  which  stops  when  it  reaches  the  deaf’  structures 
which  contain  no  pointers.  In  order  to  avoid  chang¬ 
ing  the  data  directly,  structures  containing  pointers 
are  copied  to  temporary  memory  and  the  pointers 
modified  there.  This  imposes  a  cost  on  the  sender 
that  is  proportional  to  the  amount  of  data  that  must 
be  copied  and  the  number  of  pointers  that  must  be 
adjusted.  Because  no  similar  features  are  included 
in  common  communications  libraries,  we  don’t  in¬ 
clude  any  representative  measurements  of  these  costs. 
However,  we  do  observe  that  in  the  most  common 
use  of  dynamic  arrays,  where  a  relatively  small  base 
structure  holds  pointers  and  sizes  for  one  or  more  ar¬ 
rays,  the  'walk’  is  a  simple  pass  over  the  base  struc¬ 
ture,  the  majority  of  the  data  is  in  the  'leaves’  which 
are  not  copied,  and  the  additional  sender-side  pro¬ 
cessing  is  not  overly  significant. 

3.3  Receiving  side  cost 

PBIO’s  approach  to  binary  data  exchange  eliminates 
sender-side  processing  by  transmitting  in  the  sender’s 
native  format  and  isolating  the  complexity  of  man¬ 
aging  heterogeneity  in  the  receiver.  Essentially,  the 
receiver  must  perform  a  conversion  from  the  vari¬ 
ous  incoming  'wire’  formats  to  the  receiver’s  'native’ 
format.  PBIO  matches  fields  by  name,  so  a  conver¬ 
sion  may  require  byte-order  changes  (byte-swapping), 
movement  of  data  from  one  offset  to  another,  or  even 
a  change  in  the  basic  size  of  the  data  type  (for  exam¬ 
ple,  from  a  4-byte  integer  to  an  8-byte  integer). 

This  conversion  is  another  form  of  the  "marshal¬ 
ing  problem”  that  occurs  widely  in  RPC  implemen- 
tations[l]  and  in  network  communication.  That  mar¬ 
shaling  can  be  a  significant  overhead  is  also  well 
known[2,  14],  and  tools  such  as  USC[12]  attempt 
to  optimize  marshaling  with  compile-time  solutions. 
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Unfortunately,  the  dynamic  form  of  the  marshaling 
problem  in  PBIO,  where  the  layout  and  even  the 
complete  field  contents  of  the  incoming  record  are  un¬ 
known  until  run-time,  rules  out  such  static  solutions. 
The  conversion  overhead  is  nil  for  some  homogeneous 
data  exchanges,  but  as  Figure  1  shows,  the  overhead 
is  high  (66%)  for  some  heterogeneous  exchanges. 

Generically,  receiver-side  overhead  in  communica¬ 
tion  middleware  has  several  components  which  can  be 
traded  off  against  each  other  to  some  extent.  Those 
basic  costs  are: 

•  byte-order  conversion, 

•  data  movement  costs,  and 

•  control  costs. 

Byte  order  conversion  costs  are  to  some  extent  un¬ 
avoidable.  If  the  communicating  machines  use  differ¬ 
ent  byte  orders,  the  translation  must  be  performed 
somewhere  regardless  of  the  capabilities  of  the  com¬ 
munications  package. 

Data  movement  costs  are  harder  to  quantify.  If 
byteswapping  is  necessary,  data  movement  can  be 
performed  as  part  of  the  process  without  incurring 
significant  additional  costs.  Otherwise,  clever  design 
of  the  communications  middleware  can  often  avoid 
copying  data.  However,  packages  that  define  a  ‘wire’ 
format  for  transmitted  data  have  a  harder  time  be¬ 
ing  clever  in  this  area.  One  of  the  basic  difficulties  is 
that  the  native  format  for  mixed-datatype  structures 
on  most  architectures  has  gaps,  unused  areas  between 
fields,  inserted  by  the  compiler  to  satisfy  data  align¬ 
ment  requirements.  To  avoid  making  assumptions 
about  the  alignment  requirements  of  the  machines 
they  run  on,  most  packages  use  wire  formats  which 
are  fully  packed  and  have  no  gaps.  This  mismatch 
forces  a  data  copy  operation  in  situations  where  a 
clever  communications  system  might  otherwise  have 
avoided  it. 

Control  costs  represent  the  overhead  of  iterating 
through  the  fields  in  the  record  and  deciding  what  to 
do  next.  Packages  which  require  the  application  to 
marshal  and  unmarshal  their  own  data  have  the  ad¬ 
vantage  that  this  process  occurs  in  special-purpose 
compiler-optimized  code,  minimizing  control  costs. 
However,  to  keep  that  code  simple  and  portable,  such 
systems  uniformly  rely  on  communicating  in  a  pre¬ 
defined  wire  format,  incurring  the  data  movement 
costs  described  in  the  previous  paragraph. 

Packages  that  marshal  data  themselves  typically 
use  an  alternative  approach  to  control,  where  the 
marshalling  process  is  controlled  by  what  amounts 
to  a  table-driven  interpreter.  This  interpreter  mar¬ 
shals  or  unmarshals  application-defined  data  making 


Message  size 

Figure  4:  Receiver  side  costs  for  PBIO  and 
MPI  interpreted  conversions. 


data  movement  and  conversion  decisions  based  upon 
a  description  of  the  structure  provided  by  the  applica¬ 
tion  and  its  knowledge  of  the  format  of  the  incoming 
record.  This  approach  to  data  conversion  gives  the 
package  significant  flexibility  in  reacting  to  changes 
in  the  incoming  data  and  was  our  initial  choice  for 
PBIO.  Figure  4  shows  a  comparison  of  receiver-side 
processing  costs  on  the  Sparc  for  interpreted  convert¬ 
ers  used  by  MPICH  (via  the  MPI_Unpack())  call  and 
PBIO.  PBIO’s  converter  is  relatively  heavily  opti¬ 
mized  and  performs  considerably  better  than  MPI, 
in  part  because  MPICH  uses  a  separate  buffer  for 
the  unpacked  message  rather  than  reusing  the  receive 
buffer  (as  PBIO  does).  However,  PBIO’s  receiver- 
side  conversion  costs  still  contribute  roughly  20%  of 
the  cost  of  an  end-to-end  message  exchange.  While  a 
portion  of  this  conversion  overhead  must  be  the  con¬ 
sequence  of  the  raw  number  of  operations  involved 
in  performing  the  data  conversion,  we  believed  that 
a  significant  fraction  of  this  overhead  was  due  to  the 
fact  that  the  conversion  is  essentially  being  performed 
by  an  interpreter. 

Our  decision  to  transmit  data  in  the  sender’s  native 
format  results  in  the  wire  format  being  unknown  to 
the  receiver  until  run-time,  making  a  remedy  to  the 
problem  of  interpretation  overhead  difficult.  How¬ 
ever,  our  solution  to  the  problem  was  to  employ  dy¬ 
namic  code  generation  to  create  a  customized  con¬ 
version  subroutine  for  every  incoming  record  type. 
These  routines  are  generated  by  the  receiver  on  the 
fly,  as  soon  as  the  wire  format  is  known,  through  a 
procedure  that  structurally  resembles  the  interpreted 
conversion  itself.  However,  instead  of  performing  the 
conversion  this  procedure  directly  generates  machine 
code  for  performing  the  conversion. 
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Figure  5:  Receiver  side  costs  for  interpreted 
conversions  in  MPI  and  PBIO  and  DCG  con¬ 
versions  in  PBIO. 


The  execution  times  for  these  dynamically  gener¬ 
ated  conversion  routines  are  shown  in  Figure  5.  The 
dynamically  generated  conversion  routine  operates 
significantly  faster  than  the  interpreted  version.  This 
improvement  removes  conversion  as  a  major  cost  in 
communication,  bringing  it  down  to  near  the  level  of 
a  copy  operation,  and  is  the  key  to  PBIO’s  ability  to 
efficiently  perform  many  of  its  functions. 

The  cost  savings  achieved  by  PBIO  through  the 
techniques  described  in  this  section  are  directly  re¬ 
flected  in  the  time  required  for  an  end-to-end  mes¬ 
sage  exchange.  Figure  6  shows  a  comparison  of  PBIO 
and  MPICH  message  exchange  times  for  mixed-field 
structures  of  various  sizes.  The  performance  differ¬ 
ences  are  substantial,  particularly  for  large  message 
sizes  where  PBIO  can  accomplish  a  round-trip  in  45% 
of  the  time  required  by  MPICH.  The  performance 
gains  are  due  to: 

•  virtually  eliminating  the  sender-side  encoding 
cost  by  transmitting  in  the  sender's  native  for¬ 
mat,  and 

•  using  dynamic  code  generation  to  customize  a 
conversion  routine  on  the  receiving  side  (cur¬ 
rently  not  done  on  the  x86  side). 

3.4  Details  of  dynamic  code  generation 

The  dynamic  code  generation  in  PBIO  is  performed 
by  Vcode,  a  fast  dynamic  code  generation  package 
developed  at  MIT  by  Dawson  Engler[5].  We  have 
significantly  enhanced  Vcode  and  ported  it  to  several 
new  architectures.  The  present  implementation  we 


can  generate  code  for  Sparc  (v8,  v9  and  v9  64-bit), 
MIPS  (old  32-bit,  new  32-bit  and  64-bit  ABIs)  and 
DEC  Alpha  architectures.  An  x86  port  of  Vcode  is  in 
progress,  but  not  yet  sufficiently  advanced  for  us  to 
generate  PBIO’s  conversion  routines.  Vcode  essen¬ 
tially  provides  an  API  for  a  virtual  RISC  instruction 
set.  The  provided  instruction  set  is  relatively  generic, 
so  that  most  Vcode  instruction  macros  generate  only 
one  or  two  native  machine  instructions.  Native  ma¬ 
chine  instructions  are  generated  directly  into  a  mem¬ 
ory  buffer  and  can  be  executed  without  reference  to 
an  external  compiler  or  linker. 

Employing  DCG  for  conversions  means  that  PBIO 
must  bear  the  cost  of  generating  the  code  as  well 
as  executing  it.  Because  the  format  information  in 
PBIO  is  transmitted  only  once  on  each  connection 
and  data  tends  to  be  transmitted  many  times,  con¬ 
version  generation  is  not  normally  a  significant  over¬ 
head.  Yet  that  overhead  must  still  be  considered  to 
determine  whether  or  not  the  use  of  DCG  results  in 
performance  gains. 

The  proportional  overhead  encountered  in  actually 
generating  conversion  code  varies  dramatically  de¬ 
pending  upon  the  internal  structure  of  the  record. 
This  differs  from  the  situation  in  Figure  5,  where 
the  worst-case  conversion  run-time  is  more  dependent 
upon  the  size  of  the  message  than  its  structure.  To 
understand  this  variation,  consider  the  conversion  of 
a  record  that  contains  large  internal  arrays.  In  this 
case,  the  conversion  code  consists  of  a  few  for  loops 
that  process  large  amounts  of  data.  In  comparison, 
a  record  of  similar  size  consisting  solely  of  indepen¬ 
dent  fields  of  atomic  data  types  requires  custom  code 
for  each  field.  The  result  is  that  for  records  consisting 
solely  of  arrays,  DCG  almost  always  improves  perfor¬ 
mance.  For  array-based  records  of  around  200  bytes 
the  time  to  generate  and  execute  dynamic  conversion 
code  is  less  than  the  time  to  perform  an  interpreted 
conversion.  At  that  point,  DCG  is  a  performance  im¬ 
provement,  even  if  the  conversion  routine  is  only  used 
once. 

The  situation  is  less  clear  for  record  formats  con¬ 
sisting  mostly  of  individual  atomic  fields.  For  this 
type  of  record,  dynamically  generated  conversions 
run  nearly  an  order  of  magnitude  faster  than  inter¬ 
preted  conversions,  but  the  one-time  cost  of  doing  the 
code  generation  is  relatively  high.  Obviously,  if  many 
records  are  exchanged,  the  costs  will  be  amortized 
over  the  improved  conversion  times.  But  for  one¬ 
time  exchanges  dynamic  code  generation  for  conver¬ 
sions  may  be  more  expensive  than  simple  interpreted 
conversions. 
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start  of  proceedure  bookkeeping 

save  */,sp,  -360,  '/.sp 
byteswap  load  and  store  the  ‘lvalue’  field. 
clr  Xgl 

ldswa  [  */.i0  +  y.gl  ]  #ASI_P_L,  Xg2 
st  5£g2,  [  */,  il  ] 

byteswap  load  and  store  the  ‘dvalue’  field 
mov  4,  */,gl 

ldswa  [  */.i0  +  Xgl  ]  #ASI_P_L,  y,g2 
mov  8 ,  y.gl 

ldswa  [  */.i0  +  */.gl  ]  #ASI_P_L,  '/.g3 
st  y,g3,  [  ‘/.sp  +  0x158  ] 

st  */.g2,  [  */,sp  +  Ox  15c  ] 

ldd  [  */,sp  +  0x158  ]  ,  y,f4 

std  y.f4,  [  yai  +  8  ] 

loop  to  handle  ‘iarray’ 

save  ’incoming’  and  ’destination’ pointers  for  later 
restoration 

st  #/,iO,  [  7, sp  +  0x160  ] 

st  741,  [  7. sp  +  0x164  ] 

make  regs  iO  and  il  point  to  start  of  incoming  and 
destination  float  arrays 
add  740,  Oxc,  740 
add  741,  0x10,  7#il 

setup  loop  counter 
mov  5,  7.g3 

loop  body. 
clr  Xgl 

ldswa  [  740  +  y.gl  ]  #ASI_P_L,  7.g2 
st  %g2,  [  741  ] 

end  of  loop,  increment  'incoming’  and  ’ destination ’, 
decrement  loop  count,  test  for  end  and  branch 
dec  7.g3 
add  740,  4,  740 

add  741,  4,  X il 

cmp  7,g3 ,  0 
bg,a  0xl85c70 
clr  y.gl 

reload  original  ’incoming’  and  ’destination’ pointers 
Id  [  Xsp  +  0x160  ]  ,  7,i0 

Id  [  Xsp  +  0x164  ]  ,  741 

end- of -procedure  bookkeeping 
ret 

restore 

Figure  7:  A  sample  DCG  conversion  routine. 


For  the  reader  desiring  more  information  on  the 
precise  nature  of  the  code  that  is  generated,  we  in¬ 
clude  a  small  sample  subroutine  in  Figure  7.  This 
particular  conversion  subroutine  converts  message 
data  received  from  an  x86  machine  into  native  Sparc 
data.  The  message  being  exchange  has  a  relatively 
simple  structure: 

typedef  struct  small_record  { 
int  ivalue; 
double  dvalue; 
int  iarray [8] ; 

>; 

Since  the  record  is  being  sent  from  an  x86  and  PBIO 
always  sends  data  in  the  sender's  native  data  formats 
and  layout,  the  “wire”  and  native  formats  differ  in 
both  byte  order  and  alignment.  In  particular,  the 
floating  point  value  is  aligned  on  a  4-byte  boundary 
in  the  x86  format  and  on  an  8-byte  boundary  on  the 
Sparc.  The  subroutine  takes  two  arguments.  The 
first  argument  in  register  740  is  a  pointer  to  the  in¬ 
coming  “wire  format”  record.  The  second  argument 
in  register  741  is  a  pointer  to  the  desired  destination, 
where  the  converted  record  is  to  be  written  in  native 
Sparc  format. 

The  exact  details  of  the  code  are  interesting  for  a 
couple  of  points.  First,  we  make  use  of  the  Sparc V9 
Load  from  Alternate  Space  instructions  which  can 
perform  byteswapping  in  hardware  during  the  fetch 
from  memory.  This  yields  a  significant  savings  over 
byteswapping  with  register  shifts  and  masks.  Since 
this  is  not  an  instruction  that  is  normally  generated 
by  compilers  in  any  situation,  being  able  to  use  it 
directly  in  this  situation  is  one  of  the  advantages  of 
dynamic  code  generation. 

Second,  from  an  optimization  point  of  view,  the 
generated  code  is  actually  quite  poor.  Among  other 
things,  it  performs  two  instructions  when  one  would 
obviously  suffice,  and  unnecessarily  generates  an  ex¬ 
tra  load/store  pair  to  get  the  double  value  into  a  float 
register.  The  are  several  reasons  for  this  suboptimal 
code  generation,  including  the  generic  nature  of  the 
virtual  RISC  instruction  set  offered  by  Vcode,  the 
lack  of  an  optimizer  to  repair  it,  and  the  fact  that 
we  have  not  seriously  attempted  to  make  the  code 
generation  better.  Even  when  generating  poor  code, 
DCG  conversions  are  a  significant  improvement  over 
other  approaches. 

Examining  the  generated  code  may  also  bring  to 
mind  another  lurking  subtlety  in  generating  conver¬ 
sion  routines:  data  alignment.  The  alignment  of 
fields  in  the  incoming  record  reflects  the  restrictions 
of  the  sender.  If  the  receiver  has  more  stringent  re- 
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Figure  8:  Comparison  between  PBIO  and  MPICH  in  structure  and  array  exchange  time. 
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Table  1:  A  comparison  of  PBIO  and  MPICH  for  homogeneous  exchange  of  arrays 


strictions,  the  generated  load  instruction  may  end 
up  referencing  a  misaligned  address,  a  fatal  error 
on  many  architectures.  This  situation  would  actu¬ 
ally  have  occurred  in  in  the  example  shown  in  Fig¬ 
ure  7,  where  the  incoming  double  array  is  aligned  on 
a  4  byte  boundary  because  the  Sparc  requires  8  byte 
alignment  for  8-byte  loads.  Fortunately,  the  subop- 
timal  Sparc  dynamic  code  generator  loads  the  two 
halves  of  the  incoming  8-byte  doubles  with  separate 
ldswa  instructions  instead  of  a  single  lddf  a  instruc¬ 
tion. 

Data  alignment  is  generally  not  an  issue  in  storing 
to  the  native  record  because  it  is  presumably  aligned 
according  to  the  requirements  of  the  receiving  ma¬ 
chine.  We  also  assume  that  the  base  addresses  of 
the  incoming  and  native  records  are  strongly  aligned. 
This  leaves  the  offsets  of  the  incoming  record  fields  as 
the  primary  source  of  misalignment.  Since  these  are 
known  at  code  generation  time,  we  can  make  static 
decisions  about  using  efficient  direct  loads  for  aligned 
data  or  using  potentially  less  efficient  methods  for  un¬ 
aligned  data.3 


3  Our  current  code  generator  does  not  handle  misaligned 
accesses,  but  the  extension  to  handle  them  is  straightforward. 


3.5  Other  data  types  and  homogeneous 
systems 

The  previous  sections  compared  PBlO’s  performance 
with  that  of  MPICH  in  situations  involving  a  het¬ 
erogeneous  exchange  of  structures  containing  mixed 
types.  While  PBIO  shows  clear  and  significant 
performance  gains  over  MPICH  in  that  situation, 
MPICH  might  be  expected  to  perform  better  in  deal¬ 
ing  with  messages  consisting  of  contiguous  arrays,  or 
in  a  homogeneous  exchange  where  it  might  not  use 
an  XDR-based  encoding  scheme. 

Figure  8  shows  a  breakdown  of  MPICH  and  PBIO 
performance  for  heterogeneous  transmission  of  a 
100Kb  floating  point  array  and  compares  it  to  the 
previously  presented  breakdowns  for  the  100Kb  struc¬ 
ture.  This  figure  shows  that  PBIO’s  performance 
remains  essentially  unchanged  when  the  datatype  is 
changed  from  structure  to  an  array.  MPICH  perfor¬ 
mance  does  improve  with  contiguous  arrays,  but  not 
to  the  point  where  it  matches  PBIO’s  performance. 
The  results  for  smaller  datatypes  are  similar. 

A  comparison  of  round-trip  times  for  contiguous  ar¬ 
rays  between  homogenous  machines  is  shown  in  Ta¬ 
ble  1.  This  is  one  of  the  simplest  cases  in  binary 
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Figure  9:  Receiver-side  decoding  costs  with 
and  without  an  unexpected  field  -  heteroge¬ 
neous  case. 


Figure  10:  Receiver-side  decoding  costs  with 
and  without  an  unexpected  field  -  homoge¬ 
neous  case. 


communication,  requiring  no  data  conversion  of  any 
kind.  The  send  and  receive  side  overheads  are  tiny 
compared  to  the  time  required  for  network  transmis¬ 
sion,  but  PBIO  retains  a  slight  edge  over  MPICH  in 
both  receive  and  send  side  overheads.  These  differ¬ 
ences  largely  account  for  the  10%  or  so  better  per¬ 
formance  that  PBIO  achieves  in  round-trip  time  for 
these  contiguous  arrays. 

That  PBIO  has  better  performance  than  MPICH 
even  in  situations  where  MPICH  might  be  expected 
to  prevail  is  convincing  evidence  that  PBIO’s  extra 
flexibility  in  supporting  application  evolution  does 
not  negatively  impact  performance  in  other  situa¬ 
tions.  The  next  section  will  examine  PBIO’s  per¬ 
formance  in  the  presence  of  application  evolution. 

3.6  Performance  in  application  evolution 

The  principal  difference  between  PBIO  and  most 
other  messaging  middleware  is  that  PBIO  messages 
carry  format  meta-information,  somewhat  like  an 
XML-style  description  of  the  message  content.  This 
meta-information  can  be  an  incredibly  useful  tool 
in  building  and  deploying  enterprise-level  distributed 
systems  because  it  1)  allows  generic  components  to 
operate  upon  data  about  which  they  have  no  a  priori 
knowledge,  and  2)  allows  the  evolution  and  extension 
of  the  basic  message  formats  used  by  an  application 
without  requiring  simultaneous  upgrades  to  all  appli¬ 
cation  components.  In  other  terms,  PBIO  allows  re¬ 
flection  and  type  extension .  Both  of  these  are  valuable 
features  commonly  associated  with  object  systems. 

PBIO  supports  reflection  by  allowing  message  for¬ 
mats  to  be  inspected  before  the  message  is  received. 
It’s  support  of  type  extension  derives  from  doing  field 
matching  between  incoming  and  expected  records  by 
name.  Because  of  this,  new  fields  can  be  added 


to  messages  without  disruption  because  application 
components  which  don’t  expect  the  new  fields  will 
simply  ignore  them. 

Most  systems  which  support  reflection  and  type 
extension  in  messaging,  such  as  systems  which  use 
XML  as  a  wire  format  or  which  marshal  objects  as 
messages,  suffer  prohibitively  poor  performance  com¬ 
pared  to  systems  such  as  MPI  which  have  no  such 
support.  Therefore,  it  is  interesting  to  examine  the 
effect  of  exploiting  these  features  upon  PBIO  perfor¬ 
mance.  In  particular,  we  measure  the  performance  ef¬ 
fect  of  type  extension  by  introducing  an  unexpected 
field  into  the  incoming  message  and  measuring  the 
change  in  receiver-side  processing. 

Figures  9  and  10  present  receive-side  processing 
costs  for  an  exchange  of  data  with  an  unexpected 
field.  These  figures  show  values  measured  on  the 
Sparc  side  of  heterogeneous  and  homogeneous  ex¬ 
changes,  respectively,  using  PBIO’s  dynamic  code 
generation  facilities  to  create  conversion  routines.  It’s 
clear  from  Figure  9  that  the  extra  field  has  no  ef¬ 
fect  upon  the  receive-side  performance.  Transmitting 
would  have  added  slightly  to  the  network  transmis¬ 
sion  time,  but  otherwise  the  support  of  type  extension 
adds  no  cost  to  this  exchange. 

Figure  10  shows  the  effect  of  the  presence  of  an 
unexpected  field  in  the  homogeneous  case.  Here,  the 
overhead  is  potentially  significant  because  the  homo¬ 
geneous  case  normally  imposes  no  conversion  over¬ 
head  in  PBIO.  The  presence  of  the  unexpected  field 
creates  a  layout  mismatch  between  the  wire  and  na¬ 
tive  record  formats  and  as  a  result  the  conversion 
routine  must  relocate  the  fields.  As  the  figure  shows, 
the  resulting  overhead  is  non-negligible,  but  not  as 
high  as  exists  in  the  heterogeneous  case.  For  smaller 
record  sizes,  most  of  the  cost  of  receiving  data  is  ac- 
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tually  caused  by  the  overhead  of  the  kernel  select  () 
call.  The  difference  between  the  overheads  for  match- 
ing  and  extra  field  cases  is  roughly  comparable  to  the 
cost  of  memcpyO  operation  for  the  same  amount  of 
data. 

The  results  shown  in  Figure  10  are  actually  based 
upon  a  worst-case  assumption,  where  an  unexpected 
field  appears  before  all  expected  fields  in  the  record, 
causing  field  offset  mismatches  in  all  expected  fields. 
In  general,  the  overhead  imposed  by  a  mismatch 
varies  proportionally  with  the  extent  of  the  mis¬ 
match.  An  evolving  application  might  exploit  this 
feature  of  PBIO  by  adding  any  additional  at  the 
end  of  existing  record  formats.  This  would  minimize 
the  overhead  caused  to  application  components  which 
have  not  been  updated. 

4  Conclusions 

Current  distributed  applications  rely  heavily  on 
leveraging  the  computing  power  of  heterogeneous  net¬ 
works  of  computer  architectures.  The  PBIO  library  is 
a  valuable  addition  to  the  mechanisms  available  for 
handling  binary  data  interchange  among  these  het¬ 
erogeneous  distributed  systems.  PBIO  performs  ef¬ 
ficient  data  translations,  and  supports  simple,  trans¬ 
parent  system  evolution  of  distributed  applications, 
both  on  a  software  and  a  hardware  basis. 

Rather  than  relegating  message  packing  and  un¬ 
packing  operations  to  the  communicating  applica¬ 
tions,  thus  requiring  a  priori  agreement  on  these 
data  structures,  PBIO  efficiently  layers  and  abstracts 
diversities  in  computer  architectures.  Applications 
need  only  agree  on  data  by  name,  and  previously  ex¬ 
posed  concerns  such  as  byte  ordering,  architecture 
specifications,  data  type  sizes,  and  compiler  differ¬ 
ences  are  no  longer  a  concern.  Since  PBIO  uses  dy¬ 
namic  code  generation  rather  than  data  interpreta¬ 
tion,  compiler  optimizations  are  utilized  without  the 
cumbersome  limitations  of  static  data  structures. 

Enterprise-scale  distributed  computing  can  be  im¬ 
plemented  and  deployed  much  more  simply  and  effi¬ 
ciently  using  PBIO’s  flexibility,  not  only  initially,  but 
during  the  evolution  of  specific  distributed  compo¬ 
nents.  Data  elements  can  be  incrementally  to  the  ba¬ 
sic  message  formats  of  distributed  applications  with¬ 
out  disrupting  the  operation  of  existing  application 
components. 

The  measurements  in  this  paper  have  shown  that 
PBIO’s  flexibility  does  not  impact  its  performance. 
In  fact,  PBIO’s  performance  is  better  than  that  of  a 
popular  MPI  implementation  in  every  test  case,  and 
significantly  better  in  heterogeneous  exchanges.  Per¬ 


formance  gains  of  up  to  60%  are  largely  due  to: 

•  virtually  eliminating  the  sender-side  encoding 
cost  by  transmitting  in  the  sender’s  native  for¬ 
mat,  and 

•  using  dynamic  code  generation  to  perform  data 
conversion  on  the  receiving  side. 

In  short,  PBIO  is  a  novel  messaging  middleware 
that  combines  significant  flexibility  improvements 
with  an  efficient  implementation  to  offer  distributed 
applications  fast  heterogeneous  binary  data  inter¬ 
change. 
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Abstract 

A  heuristic  algorithm  that  maps  data-processing  tasks 
onto  heterogeneous  resources  (i.e.,  processors  and  links  of 
various  capacities)  is  presented.  The  algorithm  tries  to 
achieve  a  good  throughput  of  the  whole  data-processing 
pipeline ,  taking  both  parallelism  (load  balance)  and  com¬ 
munication  volume  (locality)  into  account.  It  performs  well 
both  under  compute -intensive  and  communication-intensive 
conditions.  When  all  tasks/processors  are  of  the  same  size 
and  communication  is  negligible,  it  quickly  distributes  the 
compute  load  over  processors  and  finds  the  optimal  map¬ 
ping.  As  communication  becomes  significant  and  reveals  as 
a  bottleneck ,  it  trades  parallelism  for  reduction  of  commu¬ 
nication  traffic .  Experimental  results  using  a  topology  gen¬ 
erator  that  models  the  Internet  show  that  it  performs  sig¬ 
nificantly  better  than  communication-ignorant  schedulers. 


1.  Introduction 

It  is  widely  believed  that  future  computing  environmen- 
t  will  consist  of  geographically  distributed  compute-  and 
data-resources  connected  with  diverse  communication  ca¬ 
pacities,  forming  a  so-called  “computational  Grid”  environ¬ 
ment  [10].  Computational  elements  range  from  a  desktop 
to  clusters  [4,  5]  to  supercomputers,  and  links  range  from 
phone  lines  to  gigabits  system  area  networks.  Both  CPU 
capacity  and  the  network  connectivity  are  improving  in  a 
rapid  pace,  but  the  recent  trend  indicates  network  band¬ 
width  increases  more  rapidly  than  CPUs.  As  a  consequence, 
communication-intensive  parallel  jobs,  which  we  are  cur¬ 
rently  able  to  run  only  on  dedicated  supercomputers  or  clus¬ 
ters,  are  likely  to  be  hosted  by  a  collection  of  desktops  in 
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laboratories  or  even  home.  This  brings  the  Grid  beyond 
just  an  aggregation  of  computational  horsepower  and  en¬ 
ables  a  qualitatively  different  use  of  it.  On  the  other  hand, 
it  presents  significant  resource  management  problems  to  all 
levels  of  parallel/distributed  software  developments. 

One  of  the  fundamental  elements  of  such  resource  man¬ 
agement  problems  is,  given  an  application  that  consists  of 
many  communicating  tasks,  to  select  a  suitable  set  of  re¬ 
sources  and  map  its  tasks  appropriately.  To  obtain  a  robust 
performance  across  a  wide  range  of  resource  configurations, 
mapping  algorithms  must  trade  load  balancing  for  the  re¬ 
duction  of  communication,  and  vice  versa. 

In  this  paper,  we  present  a  graph-theoretic  formulation 
of  this  general  problem  and  propose  its  heuristic  algorith- 
m.  The  algorithm  takes  as  input  a  task  graph  and  a  re¬ 
source  graph  and  outputs  the  mapping  from  tasks  to  pro¬ 
cessors.  A  task  graph  models  a  data  processing  pipeline;  a 
task  in  a  pipeline  continuously  receives  data  from  adjacen- 
t  tasks,  processes  them,  and  sends  processed  data  to  other 
tasks.  Weights  of  nodes  and  edges  represent  compute  and 
communication  requirements  of  these  tasks,  respectively.  A 
resource  graph  models  processors  and  links.  Weights  of  n- 
odes  and  edges  represent  their  compute  and  communication 
capacities ,  respectively.  If  too  many  tasks  are  assigned  on  a 
processor  or  too  much  communication  goes  through  a  link, 
the  processor  or  the  link  becomes  a  bottleneck  and  deter¬ 
mines  the  overall  throughput  of  the  entire  pipeline. 

The  key  to  achieving  a  good  throughput  is  clustering  of 
a  task  graph,  a  process  which  recognizes  highly-connected 
components  in  a  task  graph.  A  cluster  in  a  task  graph  repre¬ 
sents  a  set  of  tasks  that  are  intensively  communicating  with 
each  other.  These  tasks  should  be  placed  in  a  single  proces¬ 
sor  if  available  communication  bandwidth  is  low.  Among 
several  graph  clustering  methods  proposed  in  the  literature 
[9,  1 3, 28],  we  use  a  simplified  version  of  the  stochastic  flow 
injection  method  [29,  30]  proposed  by  by  Yeh  et  al. 

Under  a  simple  condition  in  which  tasks  and  proces- 
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sors  are  of  a  uniform  weight  and  communication  is  neg¬ 
ligible,  it  guarantees  to  quickly  give  the  optimal  solu¬ 
tion,  in  which  tasks  are  uniformly  distributed  over  proces¬ 
sors.  As  communication  becomes  significant  and  reveals 
as  a  bottleneck,  it  co-locates  highly  communicating  tasks 
to  reduce  communication  traffic.  We  have  implemented 
the  algorithm  in  scripting  language  Python  [16]  and  per¬ 
formed  experiments  using  a  simplified  version  of  an  Internet 
topology  generator  [7,  12]  to  generate  a  realistic  resource 
graph.  As  we  expected,  our  algorithm  significantly  out¬ 
performs  simpler,  communication-ignorant  algorithms  on 
communication-intensive  conditions. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
gives  a  practical  motivating  scenario  that  we  envision  will 
commonly  occur  in  emerging  Grid  applications.  Section  3 
is  devoted  to  the  problem  formulation  and  Section  4  de¬ 
scribes  its  algorithm.  Section  5  shows  experimental  results. 
Section  6  mentions  relationship  to  other  work  and  Section  7 
summarizes  the  paper  and  states  future  work. 


2.  A  Practical  Scenario 


Consider  an  application  which  reads  a  large  volume  of 
data  from  geographically  distributed  source  (storage  serv¬ 
er),  processes  them,  and  displays  the  result  on  a  desktop. 
An  example  of  such  application  is  SARA  [22],  in  which  the 
data  is  surface  data  of  the  earth.  Emerging  distributed  ap¬ 
plications  that  use  geographically  distributed  data  such  as 
digital  libraries  [1]  and  scientific  data  archives  [6]  will  have 
more  or  less  this  kind  of  structure. 

Even  in  this  fairly  simple  setting,  one  question  that  aris¬ 
es  is  where  the  data  should  be  processed.  The  best  de¬ 
cision  clearly  depends  on  how  computationally  expensive 
the  processing  is,  how  much  data  it  reads  from  the  source 
and  writes  to  the  display,  how  computationally  powerful  are 
the  desktop  and  the  storage  server,  and  how  much  band¬ 
width  we  have  between  these  nodes.  The  decision  is  much 
more  complex  when  we  have  a  more  involved  data  process¬ 
ing  pipeline  and  more  available  resources  such  as  parallel 
compute-servers.  Finally,  the  availability  of  all  these  re¬ 
sources  changes  over  time.  For  example,  processing  should 
be  done  on  the  desktop  when  the  storage  server  is  highly 
loaded. 

It  can  easily  be  seen  that  it  is,  if  not  impossible,  difficult 
and  time-consuming  for  individual  application  developers 
to  implement  a  decision  that  works  in  a  wide  range  of  re¬ 
source  configurations,  even  in  a  very  simple  case  like  this. 
Application-specific  solutions,  if  any,  would  not  generalize 
to  even  more  complex  and  dynamic  cases,  in  which  we  have 
hundreds  of  tasks  that  are  created  and  ceased  over  time. 


3.  Problem  Description 

3.1.  Preliminary  Definitions  and  Notations 

Resource  Graph  and  Task  Graph:  A  resource  graph  is 
a  weighted  graph  (both  nodes  and  edges  are  weighted).1  A 
node  of  a  resource  graph  represents  a  processor  and  an  edge 
a  link  between  a  pair  of  processors.  The  weight  of  a  node 
represents  the  processor’s  compute  capacity  (the  amount  of 
computation  that  can  be  performed  in  a  unit  time)  and  that 
of  an  edge  the  link’s  communication  capacity  (the  amount 
of  data  that  can  go  through  the  link  in  a  unit  time). 

A  task  graph  is  also  a  weighted  graph.  A  node  of  a  task 
graph  represents  a  task  and  an  edge  a  continuous  communi¬ 
cation  (stream)  between  a  pair  of  tasks.  The  weight  of  a  n- 
ode  represents  the  task’s  compute  requirement  (the  amount 
of  computation  that  must  be  done  for  this  task  to  make  a  unit 
progress)  and  that  of  an  edge  the  communication  require¬ 
ment  of  the  connected  tasks  (the  amount  of  data  that  must 
be  communicated  for  these  tasks  to  make  a  unit  progress). 

Note  that  a  task  graph  is  not  a  traditional  dependence 
graph,  in  which  an  edge  s  -¥  t  represents  the  fact  that  task 
t  can  start  its  computation  only  after  s  has  finished.  Rather, 
our  task  graph  models  a  data  processing  pipeline ,  in  which 
all  tasks  continuously  receive  pieces  of  data,  process  them, 
and  then  send  the  processed  data.  A  typical  example  is  a 
multimedia  data  processing  pipeline  such  as  Smart  Kiosk 
[21,  20],  in  which  the  natural  unit  of  work  is  a  frame.  Typi¬ 
cal  tasks  include  compression,  decompression,  color  track¬ 
ing,  object  detection,  and  so  on.  A  weight  of  a  node  is  the 
amount  of  computation  performed  by  the  task  per  single 
frame,  whereas  that  of  an  edge  the  size  of  transferred  da¬ 
ta  per  frame. 

Unlike  other  formulations  [14,  26],  our  model  does  not 
have  an  explicit  notion  of  parallelized  tasks.  That  is,  a  s- 
ingle  node  of  a  task  graph  can  be  mapped  only  on  a  single 
node  of  a  resource  graph.  A  parallelized  task  can  be  to  some 
extent  modeled  by  many  nodes  that  together  represent  a  s- 
ingle  logical  task. 

Notations:  Let  G  be  a  weighted  graph.  Gi  is  the  weight 
of  node  i  and  G^j  the  weight  of  edge  i  j.  G{  is  the 
weighted  graph  isomorphic  to  (7,  in  which  the  weight  of 
node  i  is  one  and  that  of  all  other  nodes/edges  is  zero.  Giyj 
is  the  weighted  graph  isomorphic  to  G,  in  which  the  weight 
of  the  edges  along  the  path  from  i  to  j  is  one  and  that  of  all 
other  nodes/edges  is  zero  (Figure  1).  If  there  are  multiple 
paths  between  a  pair  of  nodes,  we  fix  one  such  path. 

Let  G  and  H  be  isomorphic  weighted  graphs.  We  define 
G  +  H  as  node-  and  edge-wise  addition  of  their  weights. 
We  similarly  define  G-H  and  G/H.  Let  A;  be  a  scalar,  kG 

1  Graphs  can  either  be  directed  or  undirected,  but  the  following  discus¬ 
sion  assumes  directed  graphs. 
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Figure  1.  A  weighted  graph  G,  G%  and  GlJ. 


denotes  a  graph  isomorphic  to  G  whose  weights  are  multi¬ 
plied  by  k. 

Interpretation:  As  we  mentioned  earlier,  a  task  graph 
models  a  set  of  tasks  each  of  which  repeatedly  receives  da¬ 
ta  from  other  tasks,  performs  some  computation  on  them 
to  produce  other  data,  and  sends  the  produced  data  to  oth¬ 
er  tasks.  We  make  it  more  precise  by  showing  the  pseudo 
code  for  a  task  t  in  a  task  graph  G  =  ( V ,  E),  as  shown  in 
Figure  2. 

The  progress  rate  of  task  t  is  determined  by  several  fac¬ 
tors.  First,  t  will  experience  a  certain  amount  of  wait  time 
at  the  wait  phase,  if  tasks  that  are  sending  data  to  t  can¬ 
not  produce  data  fast  enough  or  the  bandwidth  from  these 
tasks  to  t  are  not  enough.  Second,  more  obviously,  this  task 
will  spend  some  time  at  the  compute  step.  Finally,  the  time 
taken  at  the  send  step  will  be  determined  by  outgoing  band¬ 
width  and  how  fast  receiving  tasks  can  consume  data. 

As  will  be  made  clear  in  the  next  section,  our  problem 
formulation  effectively  makes  idealizing  assumptions  that 
the  progress  rate  of  this  task  is  determined  by  the  maxi¬ 
mum,  rather  than  the  summation,  of  these  three  factors.  For 
example,  if  the  wait  step  in  isolation  takes  5  time  units,  the 
compute  step  3  time  units,  and  the  send  step  2,  then  u- 
nit.pr ogress  as  a  whole  takes  only  5  time  units,  rather 
than  5  +  3  +  2  =  10.  This  approximates  a  situation  in 
which  these  three  phases  interleave  in  the  infinitely  fine¬ 
grained  manner;  that  is,  compute  phase  begins  process¬ 
ing  data  when  a  single  bit  of  data  appears  in  the  incoming 
stream,  and  the  send  phase  sends  data  as  soon  as  produced. 


/*g  =  (v,e). 

a  unit  work  task  t  repeats.  */ 
unit-pregress(f) 

{ 

/*  (1)  wait  */ 

for  s  G  V  s.t.  s  t  £  E  { 
wait  for  GS}t  units  (e.g.,  bytes)  of  data 
to  arrive  from  s; 

} 

/*  (2)  compute  */ 

perform  Gt  units  of  computation  upon  the 
received  data; 

/*  (3)  send  */ 
for  u  <E  V  s.t.  t  — ►  u  €  E  { 
send  Gt  u  bytes  of  data  to  u\ 

} 

} 

/*  a  task  t  simply  repeats  unit.progress  forever  */ 

task  (t) 

{ 

while  (1)  { 

unit„pr  ogres  s(t); 

} 

} 


Figure  2.  Pseudo  code  for  task  t. 


3.2.  Formulation 

We  are  interested  in  the  throughput  (the  number  of  work 
units  completed  per  unit  time)  of  the  system  in  equilibrium. 
Given  a  mapping  from  tasks  to  processors,  it  determines 
the  amount  of  computation  each  processor  must  perform  to 
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make  all  tasks  complete  a  unit  work;  it  is  simply  the  summa¬ 
tion  of  task  weights  mapped  on  the  processor  in  question. 
It  similarly  determines  the  volume  of  data  each  link  must 
transfer  to  have  all  tasks  make  a  unit-progress.  By  dividing 
the  requirement  at  each  node  (edge)  by  its  corresponding 
computation  (communication)  capacity,  we  have  the  time 
required  to  service  requested  computation/communication 
at  the  node  (edge).  We  call  it  occupancy  at  the  node  (edge). 
The  maximum  occupancy  over  the  entire  graph  gives  us  the 
time  required  to  unit-progress  all  tasks.  The  goal  is  to  make 
the  maximum  occupancy  of  the  mapping  as  small  as  possi¬ 
ble.  Note  that  an  occupancy  is  the  inverse  of  the  number  of 
unit  works  finished  per  a  unit  time.  Thus,  minimizing  the 
occupancy  is  equivalent  to  maximizing  the  throughput. 

A  more  formal  description  follows.  Let  G  —  (Vg,  Eg) 
be  a  task  graph  and  P  =  (VP,  EP)  a  processor  graph.  Let 
m  be  a  mapping  from  VG  to  VP .  We  define  the  load  graph 
of  the  mapping,  denoted  by  L(G,  P,  m),  as: 

L(G,P,m)  =  y^GtPm(t)  + 

teV  (s,t)€E 

That  is,  a  load  graph  is  a  graph  whose  weights  represent 
the  amount  of  computation  and  communication  required  (at 
each  node  and  edge)  to  unit-progress  all  tasks. 

Occupancy  graph  of  the  mapping,  denoted  by 
0(G,jP,m),  is  obtained  by  simply  dividing  the  load 
by  the  capacity  at  each  node  and  edge: 

0{G,P,m)=L(G,P,m)/P 

The  goal  is  to  find  a  mapping  m  that  minimizes 
max(0(G,  P,  m)),  where  max(X)  is  the  maximum  weight 
over  nodes  and  edges  in  graph  X.  Figure  3  shows  an  exam¬ 
ple  of  a  load  graph  and  an  occupancy  graph. 

Note  that  the  above  formulation  effectively  assumes  that 
all  tasks  progress  in  the  same  pace;  when  any  of  the  tasks 
takes  x  unit  time  to  make  a  unit  progress,  all  the  other  tasks 
also  take  x.  In  other  words,  resources  are  never  used  to 
make  some  tasks  go  faster  than  the  others.  This  is  a  prac¬ 
tical  assumption  because,  assuming  finite  communication 
buffers,  any  pair  of  communicating  tasks  must  progress  in 
the  same  pace  in  equilibrium.  Consequently,  for  connected 
task  graphs,  tasks  must  eventually  match  their  paces  with 
all  the  other  tasks. 

Finally,  we  state  that  this  problem  is  NP-hard.  We 
show  that  the  corresponding  decision  problem  TASKMAP, 
which  asks  if  a  mapping  whose  maximum  occupancy  is 
no  greater  than  a  specified  limit  exists,  is  NP-hard.  There 
are  several  NP-hard  problems  that  straightforwardly  reduce 
to  TASKMAP.  Reducing  Knapsack  problem  is  particularly 
simple;  we  however  use  a  reduction  from  the  two-way  graph 
partitioning  problem  which  is  also  NP-hard  [19],  because 
we  believe  it  better  illustrates  the  difficulty  of  the  problem 
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Figure  3.  Load  graph  and  occupancy  graph. 


(in  particular  it  also  shows  the  problem  remains  NP-hard 
even  if  we  restrict  all  tasks  to  be  the  same  size).  The  graph 
partitioning  problem,  PARTITION,  takes  an  (unweighted) 
graph  G  —  ( V ,  E)  and  an  integer  c  as  input,  and  asks  if  there 
is  a  partition  V  =  V\  +  V2,  such  that  V\  and  V2  are  disjoint 
and  equal  size  ( i.e.f  Vi  D  V2  =  0  and  |Vi|  =  \V2\  =  \V\/2) 
and  the  number  of  edges  between  V\  and  V2  is  <  c. 

Theorem  1  TASKMAP  is  NP-hard. 

Proof:  For  a  given  instance  of  PARTITION  G  —  ( V ,  E) 
and  c,  we  construct  an  instance  of  TASKMAP  as  follows. 

•  The  task  graph  is  a  graph  isomorphic  to  G,  whose  node 
weights  and  edge  weights  are  all  ones. 

•  The  resource  graph  is  a  graph  of  two  nodes,  whose 
weights  are  both  \V\/2,  and  the  weight  of  the  edge  be¬ 
tween  the  two  is  c. 

•  The  maximum  occupancy  is  one.  That  is,  we  ask  if 
there  is  a  mapping  whose  maximum  occupancy  is  no 
greater  than  1. 

It  is  easily  seen  that  if  and  only  if  there  is  such  a  map¬ 
ping,  there  is  a  solution  for  the  original  graph  partitioning 
problem,  and  the  reduction  can  be  performed  in  a  polyno¬ 
mial  time  (Q.E.D). 
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4.  The  Algorithm 

4.1.  Motivating  Example 

If  tasks  are  very  compute-bound  (communication  is  al¬ 
most  negligible),  mapping  is  relatively  straightforward,  at 
least  when  task  sizes  are  fairly  uniform.  It  simply  amounts 
to  assigning  each  processor  task  weights  roughly  proportion 
to  its  compute  capacity.  Our  main  contribution  is  on  cases 
where  tasks  are  more  communication  intensive,  thus  such 
communication-ignorant  mappings  result  in  excess  traffic 
that  limits  the  performance.  With  increasing  communica¬ 
tion  intensity  of  tasks,  it  becomes  likely  that  mapping  tasks 
that  intensively  communicate  with  each  other  on  the  same 
processor  results  in  a  significantly  better  performance. 

As  is  the  case  in  most  combinatorial  problems,  the  fun¬ 
damental  difficulty  in  achieving  such  mappings  lies  in  the 
fact  that  the  performance  as  a  function  of  mappings  is  quite 
discontinuous  and  there  are  many  local  optima;  the  de¬ 
sired  mapping  is  quite  different  from  one  communication 
intensity  to  another,  and  mappings  that  are  in  some  sense 
‘between’  these  desired  mappings  are  typically  worse  than 
both.  Therefore  it  is  difficult  to  move  from  one  desired  map¬ 
ping  to  another  by  a  series  of  greedy  moves.  To  illustrate 
this,  consider  a  task  graph  shown  in  Figure  4  where  all  n- 
odes  weigh  one  and  all  edges  weigh  c  (a  parameter).  When 
c  is  very  small,  the  desired  mapping  will  typically  be  the 
one  in  which  a  single  processor  has  a  single  task  (assuming 
sufficient  number  of  equally  powerful  processors).  As  c  in¬ 
creases  up  to  a  certain  threshold,  the  best  mapping  will  typi¬ 
cally  become  the  one  in  which  a  single  processor  is  assigned 
to  a  single  cluster  of  tasks  (as  easily  perceived  by  humans). 
Everything  between  these  two  extremes  (for  example,  map¬ 
pings  in  which  a  single  processor  has  two  tasks)  are  typical¬ 
ly  worse  than  both.  This  is  because,  when  compared  to  the 
first  extreme  (one  task  per  processor),  the  amount  of  traffic  a 
single  processor  sends  or  receives  increases,  thus  it  does  not 
reduce  the  communication  bottleneck.  The  communication 
bottleneck  can  be  eliminated  only  by  moving  all  tasks  of  a 
cluster  to  a  single  processor.  This  property  prohibits  the  use 
of  a  simple  local  search  strategy  which  tries  to  find  a  task 
t  and  a  processor  p  such  that  moving  t  to  p  improves  the 
objective  function.  It  is  quite  unlikely  that  a  series  of  such 
moves  eventually  reaches  the  desired  extreme,  whichever  is 
the  better. 

4.2.  Overall  Structure 

As  is  easily  seen  from  the  example  just  discussed,  the 
key  to  achieving  a  good  mapping  is  to  recognize  highly- 
connected  clusters,  and  use  this  clustering  information  to 
guide  the  mapping  process.  Our  basic  approach  is  to  linear¬ 
ly  order  tasks  in  such  a  way  that  tasks  within  a  cluster  are 
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Figure  4.  A  graph  with  highly-connected 
subgraphs. 


close  to  each  other,  and  put  tasks  to  processors  according 
to  this  order  (as  indicated  by  the  labels  in  the  figure).  If  a 
single  processor  is  assigned  to  multiple  tasks,  they  are  like¬ 
ly  to  be  in  the  same  cluster,  and  therefore,  when  the  tasks 
turn  out  to  be  communication-bound,  processors  can  reduce 
communication  simply  by  accommodating  more  tasks  from 
the  list. 

To  continue  the  above  example,  we  first  pick  up  a  proces¬ 
sor  and  move  tasks  to  it  from  the  list.  As  tasks  are  ordered 
as  shown  in  the  figure,  we  exclusively  choose  tasks  from  a 
cluster  (labeled  A)  in  the  beginning.  The  remaining  prob¬ 
lem  is  when  we  should  stop  this  process  and  go  onto  the 
next  processor.  The  best  answer  again  depends  on  commu¬ 
nication  intensity;  when  c  is  small,  it  is  typically  when  the 
compute-load  is  best  balanced  among  processors,  and  oth¬ 
erwise  when  one  or  more  clusters  have  just  moved.  Details 
are  given  in  Section  4.4. 

Our  entire  algorithm  first  obtains  the  appropriate  order  of 
tasks  based  on  a  simplified  version  of  stochastic  flow  injec¬ 
tion  method  proposed  in  [29,  30].  Given  this  information,  it 
obtains  an  initial  mapping  and  then  improves  it  step  by  step. 
The  elementary  procedure  mentioned  above  is  used  both  to 
obtain  the  initial  mapping  and  to  improve  it.  The  top-level 
structure  of  the  algorithm  is  illustrated  in  Figure  5. 

In  the  following  sections,  we  first  describe  the  cluster¬ 
ing  algorithm  to  obtain  the  order  of  tasks  in  Section  4.3, 
the  elementary  procedure  that  moves  tasks  to  a  processor 
from  the  list  in  Section  4.4,  and  how  to  improve  the  map¬ 
ping  once  obtained  in  Section  4.5.  Throughout  the  sections, 
G  =  ( Vg,Eg )  and  P  =  (VP,EP)  refer  to  the  given  task 
graph  and  the  resource  graph,  respectively.  As  a  conven¬ 
tion,  we  do  not  update  data  structures  in  place  (we  always 
rebind  a  variable  to  signify  an  update).  Variables  assigned 
in  one  iteration  of  a  loop  and  used  in  the  next  is  subscripted 
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/*  G  =  (Vg?  £g)  *  task  graph. 

P  =  (Vp,  £p)  :  resource  graph.  */ 
taskmapQ 

{ 

<G  =  clustering(G);  —  (section  4.3) 
m  =  {  };  /*  empty  map  */ 
m  =  map _tasks(m);  —  (section  4.4) 
repeat  { 
m!  =  m; 

m  =improve(m');  —  (section  4.5) 

}  while  (0(G,Prm)  <  0(G,P,m')) 


Figure  5.  The  overall  structure  of  the  algo¬ 
rithm. 


by  a  loop  index,  even  though  it  is  a  single  variable  in  the 
real  program. 

Finally,  we  made  various  simplifications  for  the  purpose 
of  presentation.  For  example,  the  following  algorithm  cal¬ 
culates  L(G,  P ;  m)  many  times,  with  m’s  that  only  slightly 
differ  from  each  other.  The  actual  program  keeps  track  of 
L(G,  P,  m)  all  the  time  and  incrementally  updates  it  as  m 
changes.  This  kind  of  practical  optimizations  are  not  ex¬ 
plicit  in  the  description. 


1:  clustering(P) 

{ 

T  =recursive_clustering(P); 

<H=  depth-first  traversal  order  of  T; 

5:  return  <#; 

} 

recursive.clustering(.H') 

H  =  (V,  E)  /*  a  subgraph  of  the  task  graph  */ 

10:  { 

if  (V  is  singleton  (=  {u}))  { 
return  leaf(i;) 

}  else  { 

Hi ,  •  •  • ,  Hn  =clusters  obtained  by 
15:  stochastic  flow  injection  (see  text); 

return  node(recursive_clustering(.H1), 

•  •  •,  recursive.clustering(/fn)); 

} 


Figure  6.  Clustering  Task  Graphs. 

5.  Repeat  1-4  until  the  graph  becomes  unconnected. 

6.  When  graphs  are  disconnected,  each  connected  com¬ 
ponent  is  a  cluster. 


4.3.  Clustering  Task  Graph 

The  clustering  algorithm  is  shown  in  Figure  6.  Given  a 
graph  H ,  it  first  creates  a  tree  that  hierarchically  decompose 
the  task  graph  into  clusters  (line  3).  The  root  of  the  tree  rep¬ 
resents  the  entire  set  of  nodes,  whereas  a  leaf  a  singleton 
set  of  a  node.  Children  of  a  node  are  partitions  of  the  par¬ 
ent  node,  obtained  by  a  simplified  stochastic  flow  injection 
method  as  described  below.  Once  such  a  tree  is  obtained, 
we  determine  a  total  order  between  nodes,  <H,  simply  by 
traversing  the  tree  in  a  depth-first  order  (line  4). 

The  stochastic  flow  injection  was  originally  proposed  for 
VLSI  circuit  partitioning  and  works  as  follows: 

1 .  Randomly  pick  up  two  nodes  s  and  t  of  the  given  graph 

2.  Find  the  shortest  path  between  s  and  t. 

3.  Decrement  the  weights  of  all  the  edges  on  the  path  by 
a  (small)  constant  A  ( i.e. inject  a  flow  A  between  s 
and  f). 

4.  Remove  edges  whose  weight  become  zero  or  negative. 


The  intuition  is  that  if  only  a  small  number  of  edges  bridge 
two  (or  more)  large  clusters,  such  edges  are  likely  to  be 
decremented  often,  and  the  graph  soon  becomes  disconnect¬ 
ed  by  these  edges. 

In  the  original  stochastic  flow  injection  method,  anoth¬ 
er  phase  follows  to  merge  some  of  the  clusters  hereby  ob¬ 
tained,  but  we  simply  skip  this  phase,  because  our  purpose 
is  to  recursively  decompose  clusters  until  each  cluster  be¬ 
comes  a  singleton.  We  also  slightly  modified  the  above  step 
1,  so  that  a  task  is  chosen  by  a  probability  proportional  to 
its  weight;  this  is  necessary  because  the  original  stochastic 
flow  injection  method  assumes  uniform  weights  (as  in  the 
case  in  their  application). 

4.4.  The  Elementary  Move 

Procedure  map.tasks  shown  in  Figure  7  takes  as  a  pa¬ 
rameter  m,  a  partial  mapping  from  tasks  to  processors  (it 
is  partial  because  some  tasks  are  not  mapped).  It  maps 
tasks  not  mapped  in  m  onto  Vp,  by  simply  making  a  series 
of  calls  to  a  more  elementary  procedure  map.tasks.on, 
which  maps  some  tasks  to  a  specified  processor. 

The  procedure  map.tasks.on  takes  three  parameters, 
m,  q,  and  Q;  m  is  a  partial  mapping  from  tasks  to  pro- 
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1:  map-tasks (m) 

{ 

Q  =  VP; 

while  ( Q  ^  {})  { 
q  =  a  processor  G  Q\ 

5:  Q  =  Q-{q}\ 

m  =  map_tasks-on(m,  g,  Q)\ 

} 

return  m; 

} 

10:  /*  move  some  of  the  tasks  not  mapped  in  m  to 

processor  q,  taking  open  communication  and 
the  balance  between  {  q  }  and  Q  into  account  */ 
map_tasks_on(m,  q,  Q) 

15:  Umo  =  {  1 1  not  mapped  in  m  }; 
fori  =  l,---,\Umo\  { 
t  =  the  minimum  task  6  Umi_ j  w.r.t.  <g; 
mi  =  mj_i[t/g];  /*  add  mapping  t  q*/ 

umi  =  cw.  -  { * }; 

20:  0  =  0{G,P,rrn); 

Ocomp  Ocomp (G .  P ,TTli,Q), 

/*  Ocomp  =  00  if  0  =  {}  */ 

O-y  =  0_+(G,P,mi,17mi)g); 

0<_  =  0<-(G,P,rrn,q,Umi)\ 

25:  Mi  =  max(0,  OCOmp,  0_>,  0<_); 

find  i  that  gave  minimum  Mj  (i  =  1,  •  •  • ,  |t/mo|); 

break  ties  by  selecting  largest  i. 
return  m,i\ 

30:  } 


Figure  7.  The  elementary  move  operation. 


cessors,  q  a  processor  €  Vp  onto  which  some  tasks  are 
going  to  be  mapped  by  the  procedure,  and  Q  a  subset  of 
VP  ( q  £  Q)  yet  unused.  The  goal  is  to  put  an  appropri¬ 
ate  number  of  tasks  on  q,  so  that  we  are  likely  to  reach 
a  good  final  mapping,  if  the  remaining  tasks  are  mapped 
on  Q.  As  mentioned  earlier,  it  puts  tasks  one  after  anoth¬ 
er  in  the  order  obtained  by  the  clustering;  as  we  add  more 
tasks  to  q,  we  obtain  a  series  of  mappings  m0  =  m,mi  = 
m0[ti/q},m2  =  mi[t2/q],  •  •  •  ,mn  =  mn-i[tn/q],2  where 
h  <g  h  <G,  ■  •  • ,  <G  <n  and  mn  is  the  total  mapping  from 
Vg  to  Vp.  So  the  only  question  is  which  m;  we  should 
choose. 

Let  Um  denote  the  set  of  tasks  that  are  not  mapped  in  m. 
At  each  step,  we  keep  track  of  the  following  four  (three  in 
case  of  undirected  graphs)  values  to  evaluate  the  situation. 

•  (Line  20):  The  current  occupancy  0{G,  P,  m*). 

•  (Line  21):  A  hypothetic  occupancy  Ocomp- 

Ocomp{G,P,mi,Q)  is  an  occupancy  estimated 
by  assuming  that  tasks  G  Um,  are  perfectly  mapped 
on  Q,  ignoring  communication.  That  is,  it  is  simply 
the  total  compute  requirement  of  these  tasks  over  the 
total  compute  capacity  of  Q: 

Ocomp(G,P,m,Q)  =  ^<6C/mp-, 

2-^peQ  rv 

For  convenience  we  define  this  to  be  oo  when  Q  —  {  } . 


•  (Lines  23  and  24):  Hypothetic  occupancies 

0^(G,P,mi,Umi,q)  and  (G,  F, 

which  we  call  occupancies  induced  by  open  commu¬ 
nication.  Given  a  set  of  tasks  T  and  a  processor  q, 
we  define  open  communication  from  T  to  q  (from 
q  to  T)  to  be  the  total  communication  volume  from 
tasks  in  T  to  tasks  on  q  (from  tasks  on  q  to  tasks  in 
T).  0_>  (G,  P,m,T,  q)  refers  to  open  communication 
from  T  to  q  divided  by  the  total  edge  capacity  adjacent 
to  q.  Similarly  for  0<_.  That  is: 


0->{G,P,m,T,q) 

0<-{G,P,m,q,T) 


and 


When  graphs  are  undirected,  these  two  give  the  same 
value  and  are  collectively  referred  to  as  0++. 


At  each  step,  we  calculate  the  above  four  (or  three  in  undi¬ 
rected  case)  values  and  record  the  maximum  of  them  (Mi 
at  line  25).  The  procedure  returns  m*  that  minimizes  Mi 
(lines  27-29). 

2m'  =  m[t/q]  is  an  extension  of  m,  s.t.  m'(t)  =  q  and  m'(x)  — 
m{x)  for  x  ^  t. 
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The  first  item  will  be  intuitive.  The  second  one,  Oc omp, 
tries  to  estimate  how  much  is  the  final  occupancy  going 
to  be.  Given  this  estimate,  we  determine  how  many  tasks 
should  be  accommodated  to  the  current  processor  q.  For 
example,  suppose  compute  capacity  of  q  is  1,  the  total  com¬ 
pute  capacity  of  Q  99,  and  the  total  compute  requirement  of 
tasks  yet  to  be  mapped  1,000.  Ideally,  we  like  to  obtain  a 
mapping  whose  maximum  occupancy  is  close  to  1,000/(1  + 
99)  =10.  Put  differently,  when  we  compare  a  series  of  map¬ 
pings  mi ,  m2,  •  •  *,  any  mapping  whose  occupancy  is  below 
1 0  is  equally  good;  there  is  no  points  in  quitting  at  mu  when 
the  occupancy  of  m*+ 1  is  still  below  10. 

The  third  item,  (0<_)  or,  open  communication  met¬ 
ric  is  to  identify  m*  at  which  the  communication  volume 
between  tasks  already  mapped  on  q  and  those  that  are  not  is 
small.  Keeping  track  of  such  communication  is  necessary 
because  it  is  not  taken  into  account  by  0(G,  P ,  m*),  which 
only  counts  tasks  mapped  in  mu  This  guides  the  mapping 
process,  by  giving  following  pieces  of  information:  “rather 
than  choosing  an  ms  at  which  open  communication  is  so 
large,  accommodate  more  tasks  and  choose  mg,  at  which 
the  processor  is  more  loaded,  but  communication  traffic  is 
much  smaller.”  Accurate  estimation  clearly  requires  not  on¬ 
ly  communication  volume,  but  also  the  link  bandwidth  from 
q  to  processors  that  accommodate  the  other  tasks.  An  ob¬ 
vious  problem  is  we  are  yet  to  know  how  remaining  tasks 
will  be  mapped,  so  we  do  not  precisely  know  how  much 
will  the  occupancy  of  these  links  be.  We  simply  estimate 
this  by:  (1)  calculating  the  total  communication  volume  be¬ 
tween  tasks  on  q  and  the  remaining  tasks,  and  (2)  dividing 
it  by  the  total  link  capacity  adjacent  to  q.  This  effectively 
assumes  such  communication  will  be  routed  evenly  across 
all  adjacent  links  and  internal  links  (not  adjacent  to  a  pro¬ 
cessor)  will  not  be  bottleneck.  These  assumptions,  the  first 
one  in  particular,  may  be  optimistic  and  need  be  more  so¬ 
phisticated  when  q  has  multiple  adjacent  links.  In  our  ex¬ 
periments,  a  processor  is  adjacent  only  to  a  single  link,  thus 
this  is  not  an  issue. 

To  illustrate  how  the  procedure  works,  let  us  look  at  a 
process  that  maps  tasks  to  a  processor  as  shown  in  Figure  8. 
We  start  from  the  empty  mapping  and  add  tasks  to  the  left 
processor,  in  the  order  indicated  by  the  numbers.  Edges 
and  nodes  in  the  task  graph  weigh  one.  The  edge  of  the 
resource  graph  weighs  one  and  the  two  nodes  five.  Figure  9 
plots  O ,  Ocomp »  and  O ^  (graphs  are  undirected)  at  every 
step.  Observe  that  the  open  communication  metric  goes  up 
and  down  and  that  0{  (the  maximum  of  the  three  values) 
minimizes  at  m8,  even  though  compute  load  between  the 
two  processors  best  balances  at  mu  (the  point  where  two 
graphs  Ocomp  and  O  intersect).  Therefore  the  procedure 
will  choose  to  put  8  tasks  on  the  left  processor,  which  is 
optimal.  If  edges  of  the  task  graph  weigh  much  smaller 
(say,  0.1),  on  the  other  hand,  the  graph  of  0<+  will  become 
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Figure  8.  Example  graph  to  illustrate 

map_tasks_on. 


Figure  9.  How  O,  OCOmp?  and  O changes  as 
we  put  tasks  to  the  left  processor  in  Fig¬ 
ure  8. 


much  lower,  giving  the  best  M{  at  mn .  So  in  this  case,  the 
first  processor  will  get  1 1  tasks,  which  is  again  optimal. 

Note  that  in  general,  for  the  easy  case  where  commu¬ 
nication  is  negligible  and  task  and  processors  weigh  u- 
niformly  (w.o.l.g.  assume  they  weigh  1),  the  procedure 
mapjtasks({})  is  guaranteed  to  return  the  optimal  map¬ 
ping  in  which  no  processors  get  more  than  \N/P]  tasks, 
where  N  is  the  number  of  tasks  and  P  the  number  of  pro¬ 
cessors.  To  see  this,  consider  what  happens  in  the  first  call 
to  map_tasks„on({},  q,  Vp  —  {  q  }).  As  communication 
is  negligible,  it  simply  amounts  to  finding  the  intersection 
of  two  graphs  0  =  i  and  OCOmp  =  (N  -  i)/(P  -  1).  Solv¬ 
ing  the  equation  O  =  OCOmp  gives  i  =  N/P  and  thus  the 
best  value  is  obtained  either  at  \N/P ]  or  \N/P]  -  1.  We 
can  repeat  this  argument  to  show  that  this  is  the  case  for 
other  processors.  This  property  ensures  our  mapping  proce¬ 
dure  quickly  gives  a  good  solution  for  compute-mostly  jobs 
without  iterating  improvements. 
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Other  Implementation  Notes:  The  actual  implementa- 
tion  of  the  procedure  is  a  bit  more  sophisticated  to  avoid 
useless  computation. 

•  map_tasks_on  quits  as  soon  as  0(G,P,rrii)  be¬ 
comes  greater  than  any  of  Mj  (j  <  i).  Since 
0(G,  P,  rrii)  is  monotonically  non-decreasing  with  re¬ 
spect  to  i,  once  this  condition  is  observed,  we  have: 

Mk  >  0(G,P,ro*)  >  0{G,P,rm)  >  Mj  for  all  k  >  i. 

Thus  there  is  no  chance  that  we  observe  a  better  Mk 
in  future.  Again,  this  guarantees  that  in  the  easy  case 
mentioned  above,  map_tasks_on  quits  as  soon  as  it 
puts  \N/P]  +  1  tasks  on  a  processor. 

•  Both  map-tasks  and  map_tasks_on  optionally 
take  one  more  parameter,  u,  which  specifies  the  oc¬ 
cupancy  they  should  at  least  achieve.  map_tasks_on 
quits  as  soon  as  0(G,  P,  m,)  becomes  greater  than  this 
value,  map-tasks  aborts  the  entire  process  as  soon  as 
0Comp(G,  Pi  mi,  Q)  gets  larger  than  u  in  an  iteration. 
This  is  useful  when  we  already  know  a  mapping  and 
try  to  improve  it.  In  such  circumstances,  we  determine 
u  based  on  the  current  occupancy  ( e.g .,  u  =  the  current 
occupancy  x  0.9)  and  give  it  to  map -tasks. 

4.5.  Iterative  Improvement 

Procedure  improve  in  Figure  10  tries  to  improve  a  giv¬ 
en  (total)  mapping  m  by  first  removing  some  tasks  from 
m  (line  3)  and  then  applying  map-tasks  to  the  partial 
mapping  obtained  this  way.  Obviously,  the  key  is  to  iden¬ 
tify  a  small  subset  of  tasks  whose  removal  gives  us  a  good 
chance  to  improve  the  mapping.  A  silly  selection  algorith- 
m  could  remove  all  the  tasks  from  m,  effectively  applying 
map-tasks  again  from  the  empty  mapping. 

The  selection  algorithm  works  as  follows. 

1.  First  calculate  the  current  max  occupancy  and  multi¬ 
ply  it  by  an  acceleration  factor  (currently  0.75).  We 
remove  tasks  until  the  resulting  mapping  gives  max  oc¬ 
cupancy  below  this  value  (line  10). 

2.  We  scan  nodes  and  edges  of  the  resource  graph,  trying 
to  find  an  edge  or  a  node  whose  occupancy  is  greater 
than  it. 

3.  If  such  a  node  is  found,  let  p  be  the  node.  Find  tasks  s* 
(i  =  1, 2,  •  •  ■),  such  that  s,  is  mapped  on  p  and  is  not 
deleted  yet.  Among  all  such  tasks,  select  the  heaviest 
task. 

4.  If  such  an  edge  is  found,  let  l  be  the  edge.  Find  pairs  of 
tasks  (si:  U)  (i  =  1, 2,  •  •  •)>  such  that  the  route  between 
Si  and  U  (on  the  processor  graph)  uses  /  and  either  Si 


1:  impro ve(m) 

{ 

m  =  remove-bottlenecks(m); 
m  =  map_task  s(m); 

5:  return  m; 

} 

remove  _bottlenecks(m) 

{ 

10:  o  —  0.75  x  max(0(G,  P,  m)); 

£  =  {};/*  set  of  deleted  mappings  */ 
while  (max(0(G,P,  m  —  D)  >  o ))  { 
L^L(G,P,m-D); 

find  if  any  p  £  P  and  q  €  P  s.t.  Pp,q  >  o\ 
15:  if  found  { 

select  s,  t  €  VG  s.t.  (s  #  D  or  t  &  D\ 
pm{$)Mt)  =  i  ancj  Qs  t  js  maximum; 

D  =  D  +  {($,m($)),  (t,m(t))}; 

}  else  { 

20:  there  must  be  p  £  P  s.t.  LP!PP  >  o\ 

select  s  €Vg  s.t.  s  $  D, 
m(s)  =  p,  and  Gs  is  maximum; 

D  =  D  +  {(s,m(s))}; 

} 

25:  } 

return  m  —  D\ 

} 


Figure  10.  The  procedure  to  improve  the 
current  mapping. 
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or  t{  is  not  deleted  yet.  Among  all  such  pairs,  select  the 
most  heavily  communicating  pairs  and  delete  them. 

5.  Repeat  steps  2-4  until  the  occupancy  becomes  less  than 
the  target  value  computed  at  step  1. 

It  basically  tries  to  identify  a  set  of  tasks  that  form  bottle¬ 
necks ,  tasks  making  the  current  occupancy  so  large.  It  finds 
an  edge  or  node  in  the  resource  graph  whose  occupancy  is 
larger  than  the  target  value  calculated  from  the  current  oc¬ 
cupancy.  If  found,  tasks  contributing  to  the  edge  or  the  node 
are  candidates. 

While  reasonable,  this  algorithm  still  has  a  room  for  fur¬ 
ther  improvements  which  we  are  yet  to  experiment  with.  It 
does  not  pay  attention  to  communication  induced  between 
deleted  tasks  and  undeleted  tasks.  If  the  communication  be¬ 
tween  them  is  large,  attempts  to  moving  those  deleted  tasks 
unavoidably  induce  a  large  communication  traffic  and  are 
likely  to  fail.  Among  many  ways  to  select  candidate  tasks, 
we  like  to  select  a  set  of  tasks  that  do  not  intensively  com¬ 
municate  with  the  other  tasks.  If  such  selection  cannot  be 
obtained,  it  makes  sense  to  co-migrate  some  of  the  other 
tasks  too,  even  if  they  do  not  constitute  the  bottleneck. 

5.  Experiments 

5.1.  Graph  Generation 

We  used  a  simplified  version  of  the  Internet  topology 
model  described  in  [7,  12]  to  generate  resource  graphs. 
While  they  model  WAN,  MAN,  and  LAN,  we  omit  MAN- 
s  for  simplicity  and  model  resource  graphs  by  two  level 
(WAN  and  LAN)  hierarchy.  Given  a  configuration  that 
describes  such  parameters  as  the  number  of  WAN  nodes, 
LANs,  nodes  within  a  LAN,  and  compute  capacity  of  a  pro¬ 
cessor,  it  generates  a  graph  as  follows. 

•  First  generate  the  specified  number  of  WAN  nodes  and 
randomly  place  them  in  a  specified  rectangle.  Create 
edges  between  all  pairs  of  nodes,  associating  a  cost 
proportional  to  its  length  with  each  edge.  Then  make 
the  minimum  spanning  tree  of  the  resulting  complete 
graph. 

•  Generate  the  specified  number  of  LANs.  For  each 
LAN,  first  create  a  gateway  and  randomly  place  it  in 
the  specified  rectangle.  Connect  gateway  to  its  nearest 
WAN  node.  Then  generate  a  randomly  chosen  number 
of  nodes  in  the  LAN.  LAN  is  modeled  as  a  (shallow) 
tree  whose  root  is  connected  to  its  gateway  and  each 
node  has  a  randomly  chosen  number  of  children.  The 
compute  capacity  within  a  single  LAN  is  uniform  and 
chosen  randomly. 


Generated  Network 


Horizontal  Distance 

Figure  11.  A  typical  resource  graph  used 
by  the  experiments.  It  was  generated  by  a 
simplified  Internet  topology  generator. 


Table  1  lists  relevant  parameters  and  Figure  1 1  shows  a  typ¬ 
ical  graph  generated  by  this  model.  A  sector  in  the  figure 
is  a  LAN,  which  has  from  10  to  20  nodes.  Depth  of  some 
sectors  are  one  and  that  of  others  two. 

For  task  graphs,  we  generate  a  pipeline  of  parallel  jobs 
for  each  run  as  follows. 

1.  Randomly  choose  the  number  of  tasks  in  a  parallel  job 
(m),  and  create  a  complete  graph  of  m  nodes.  Nodes 
within  a  single  parallel  job  are  equally  weighted  and 
the  weight  is  randomly  chosen. 

2.  Repeat  the  step  1  a  randomly  chosen  number  (n)  of 
times  and  obtains  complete  graphs. 

3.  Connect  these  complete  graphs  to  form  a  simple 
pipeline  (without  branches  and  merges).  To  connec- 
t  two  complete  graphs  A  and  B ,  we  simply  form  a 
complete  bipartite  graph  (create  an  edge  between  ev¬ 
ery  task  in  A  and  every  task  in  B).  Each  edge  weighs 
1.0/ (a  x  6),  where  a  and  b  are  the  number  of  nodes  in 
A  and  B,  respectively.  The  total  communication  vol¬ 
ume  between  two  parallel  jobs  is  always  1.0. 
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Resource  Graph 

the  number  of  WAN  nodes 

10 

the  number  of  LANs 

10 

bandwidth  between  WAN  nodes 

1000.0 

WAN  <4  LAN  bandwidth 

500.0 

LAN  bandwidth 

50.0 

the  number  of  children  for  a  LAN  node 

[5,20] 

compute  capacity  of  a  processor 

[3.0,15.0] 

Task  Graph 

the  number  of  clusters  in  a  task  graph 

[5,10] 

the  number  of  tasks  in  a  cluster 

[5,10] 

compute  requirement  of  a  task 

[1.0, 3.0] 

(total)  comm,  between  a  pair  of  clusters 

1.0 

communication  intensity  parameter 

c  (see  text) 

Table  1.  Parameters  used  in  the  experi¬ 
ments.  [a,  b]  means  that  a  value  is  chosen 
randomly  from  [a,  b]  for  each  run. 


Edges  within  a  single  parallel  job  are  equally  weighted  and 
the  weight  is  chosen  randomly  from  [1,  c],  where  c  is  a  pa¬ 
rameter  that  controls  the  communication  intensity  of  the 
tasks.  We  compare  performance  of  several  algorithms  for 
various  values  of  c. 

Let  us  perform  a  rough  calculation  to  see  how  communi¬ 
cation  intensity  of  tasks  vary  according  to  c.  Since  compute 
requirement  per  task  is  from  1.0  to  3.0,  and  capacity  per 
processor  is  3.0  to  15.0,  the  occupancy  of  a  processor  ranges 
from  1.0/15.0  to  1.0,  assuming  a  single  processor  accom¬ 
modates  a  single  task.  When  sufficiently  many  tasks  are 
created,  one  of  the  processors  is  likely  to  get  an  occupancy 
close  to  1.0.  On  the  other  hand,  since  the  number  of  tasks 
in  a  parallel  job  is  from  5  to  10,  the  communication  vol¬ 
ume  per  task  is  from  5c  to  10c  (ignoring  inter-cluster  com¬ 
munication,  which  is  a  fraction).  Considering  LAN  band¬ 
width,  which  is  50.0,  occupancy  of  an  edge  adjacent  to  a 
processor  is  0.1c  to  0.2c,  again  assuming  a  single  task  on  a 
single  processor.  Comparing  the  expected  node  occupancy 
(«  1.0)  and  this  value,  clustering  is  unlikely  to  be  necessary 
for  c  w  1.  In  this  sense,  for  c  «  1,  tasks  are  hardly  commu¬ 
nication  intensive.  For  c  «  16,  on  the  other  hand,  an  edge 
occupancy  will  range  from  1.6  to  3.2,  much  larger  than  the 
expected  processor  occupancy.  Therefore  when  c  «  16,  a 
good  solution  is  likely  to  use  clustering. 

5.2.  Results 

We  compare  the  following  four  algorithms  for  c  =  1,2, 
4,  8,  12,  and  16. 

Base:  Do  not  use  the  open  communication  metric  de¬ 
scribed  in  Section  4.4.  Also  do  not  perform  the  im¬ 


provement  phase  described  in  Section  4.2. 

Base  -h  improve:  Do  not  use  the  open  communication 
metric.  Apply  the  improvement  phase  after  an  initial 
mapping  is  obtained,  again  without  open  communica¬ 
tion  metric. 

Open:  Use  the  open  communication  metric.  But  do  not 
perform  the  improvement  phase. 

Open  4-  improve:  Use  the  open  communication  metric 
and  apply  the  improvement  phase  to  the  initial  map¬ 
ping. 

For  each  value  of  c,  we  generate  32  instances  of  the  problem 
and  run  the  four  algorithms.  For  each  instance  and  for  each 
algorithm,  we  calculate  the  improvement  of  the  occupancy 
against  Base.  Graphs  in  Figure  12  show  the  result.  A  dot 
corresponds  to  an  instance  and  the  value  represents  the  rela¬ 
tive  improvement  over  Base  (Note  that  in  Base  4-  improve, 
the  number  of  dots  looks  much  smaller  than  32.  This  is  be¬ 
cause  results  are  in  many  cases  1;  i.e no  improvement  is 
observed).  Figure  13  shows  the  average  improvement  over 
32  instances. 

It  is  clear  that  taking  open  communication  into  account 
becomes  significant  as  tasks  become  communication  inten¬ 
sive.  As  we  have  expected,  all  four  algorithms  perform  e- 
qually  well  for  c  «  1.  Adding  the  iterative  improvements 
to  Open  slightly  improved  performance,  but  not  very  much. 
As  we  have  discussed  in  Section  4.5,  our  task  selection  al¬ 
gorithm  is  not  very  sophisticated  yet,  so  we  need  more  ex¬ 
periments  to  be  conclusive. 

6.  Related  Work 
6.1.  Task  Scheduling 

There  are  a  number  of  studies  on  task  scheduling  in  het¬ 
erogeneous  environments  [8,  11,  15,  17,  18,  27].  To  the 
author’s  knowledge,  most  of  these  work  have  been  focus¬ 
ing  on  scheduling  DAGs,  in  which  a  task  graph  represents 
dependencies  between  tasks.  DAG  scheduling  problem  and 
the  throughput  optimization  problem  discussed  in  this  paper 
are  quite  different,  both  in  terms  of  basic  techniques  em¬ 
ployed  and  target  applications.  In  terms  of  techniques,  most 
algorithms  for  DAG  scheduling  are  more  or  less  based  on 
a  list  scheduling,  whereas  the  basic  model  of  the  through¬ 
put  optimization  is  graph  partitioning.  For  target  applica¬ 
tion,  DAG  scheduling  applies  to  a  set  of  many  tasks  that 
rarely  communicate  with  each  other,  whereas  the  through¬ 
put  optimization  problem  to  tasks  communicating  via  high- 
bandwidth  streams.  While  both  are  important,  we  believe 
the  throughput  optimization  problem  discussed  in  this  paper 
will  increasingly  become  important  for  emerging  multime¬ 
dia  and  data-intensive  applications  on  wide  area. 
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Figure  12.  Improvements  of  the  various 
methods  over  the  Base  method  (Internet 
model). 


Figure  13.  Average  improvements. 


Several  studies  on  scheduling  with  bandwidth  metrics 
have  been  done.  Subhlok  et  al.  [25,  26]  studied  optimal 
processor  allocation  for  a  set  of  communicating  data  paral¬ 
lel  tasks,  both  with  latency  and  bandwidth  metrics.  In  their 
problem  setting,  performance  of  a  task  is  a  function  of  the 
number  of  processors  allocated  for  that  task  and  does  not 
depend  on  which  processors  are  used.  They  make  a  simi¬ 
lar  assumption  on  communication  performance.  Therefore 
the  problem  amounts  to  determining  how  many  processors 
should  be  allocated  for  each  task.  This  effectively  assumes 
two  things.  One  is  that  processor  speed  is  uniform.  The  oth¬ 
er  is  that  link  bandwidth  is  not  only  uniform  but  also  very 
high,  so  the  locations  of  communicating  tasks  do  not  matter. 
This  will  be  a  good  model  for  system-area  cluster,  which  is 
their  target  environment,  but  will  not  be  directly  applicable 
to  multimedia/data-intensive  applications  on  wide  area. 

Developing  applications  that  exhibit  robust  performance 
over  a  wide  range  of  resource  conditions  have  become  such 
an  important  issue.  Several  frameworks  have  been  pro¬ 
posed  [3,  24]  and  many  practical  studies  on  adaptive  appli¬ 
cations  in  heterogeneous  environments  have  been  conduct¬ 
ed  [2,  23].  While  such  studies  are  certainly  instructive,  it  is 
difficult  for  individual  programmers  to  perform  such  studies 
for  every  single  application.  We  believe  that  task  mapping 
should  be  much  more  automated. 

6.2.  Graph  Partitioning 

Graph  partitioning^tries  to  cut  a  graph  into  two  ore  more 
sub-graphs  each  of  which  is  more  connected  than  the  en¬ 
tire  graph.  Our  problem  shares  the  common  difficulty  with 
this  basic  problem,  in  that  moving  any  single  node  or  ex¬ 
changing  any  single  pair  of  nodes  is  not  likely  to  improve 
the  objective  function. 

Kernighan  and  Lin  [  1 3]  dealt  with  the  basic  two-way  par¬ 
titioning  problem  to  cut  the  graph  into  two  graphs  of  exactly 
the  same  size  and  gave  the  basic  idea  to  overcome  the  lo¬ 
cal  optima.  Fidducia  and  Mattheyses  [9]  proposed  a  faster 
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algorithm  for  a  slightly  different  problem,  in  which  a  cer¬ 
tain  amount  of  difference  between  the  sizes  of  the  two  sub¬ 
graphs  is  accepted.  Wei  et  al.  further  proposed  a  ratio  cut 
[28],  which  automatically  achieves  a  balance  between  a  low 
cut  size  and  a  good  ratio  of  the  sub-graph  sizes.  Finally,  Yeh 
et  al.  proposed  multi-way  partitioning  based  on  stochastic 
flow  injection  method  [29,  30]. 

While  our  current  algorithm  can  basically  use  any  good 
partitioning  algorithm  as  the  preprocessing  of  a  task  graph, 
the  following  property  of  the  Yeh’s  method  is  particularly 
attractive  for  our  purpose;  it  can  not  only  find  highly  con¬ 
nected  components  from  a  graph,  but  also  finds  the  (nega¬ 
tive)  fact  that  no  more  natural  clusters  exist  in  a  graph,  in 
which  case  it  typically  divides  the  graph  into  many  single- 
tons.  Having  only  two-way  partitioning,  we  still  have  to 
apply  two-way  partitioning  recursively.  This  is  computa¬ 
tionally  expensive  and  does  not  improve  quality. 

7.  Summary  and  Future  Work 

We  have  presented  a  heuristic  algorithm  for  a  task  map¬ 
ping  problem,  which  takes  compute  and  bandwidth  require¬ 
ments  into  account.  The  key  to  achieving  good  perfor¬ 
mance  is  clustering,  a  process  that  recognizes  intensively- 
communicating  tasks.  We  use  this  clustering  information 
to  obtain  the  order  in  which  tasks  should  be  put  on  proces¬ 
sors.  Open  communication  metric  was  introduced  to  decide 
how  many  tasks  should  be  put  in  a  processor.  The  algorith- 
m  is  able  to  incrementally  improve  a  given  mapping,  mov¬ 
ing  only  those  tasks  that  form  the  bottleneck.  Therefore  it 
can  efficiently  fix  a  significant  load  imbalance  caused  by  a 
small  number  of  tasks.  We  observed  expected  experimen¬ 
tal  results,  indicating  that  our  communication-sensitive  al¬ 
gorithm  significantly  outperforms  simpler,  communication- 
ignorant  algorithms  for  communication-intensive  jobs. 

We  are  planning  to  enhance  this  work  in  several  ways. 
First,  we  are  going  to  improve  the  task  selection  algorithm 
for  incremental  improvements,  so  that  it  moves  clusters  that 
do  not  intensively  communicate  with  the  rest  of  the  tasks. 
Second,  we  will  analyze  computational  complexity  of  the 
algorithm  in  detail.  Third,  we  will  try  to  identify  other  cases 
where  this  algorithm  guarantees  to  produce  a  result  within 
a  constant  of  the  optimal.  Practical  goals  include  develop¬ 
ing  a  system  that  automatically  selects  resources  and  maps 
tasks  on  wide  area,  which  helps  Grid  application  designer- 
s  develop  performance-portable  Grid  code.  We  hope  this 
work  serves  as  a  sound,  logical  step  toward  achieving  this 
goal. 
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Abstract 

Applications  that  use  collections  of  very  large ,  dis¬ 
tributed  datasets  have  become  an  increasingly  important 
part  of  science  and  engineering .  With  high  performance 
wide-area  networks  becoming  more  pervasive ,  there  is  in¬ 
terest  in  making  collective  use  of  distributed  computational 
and  data  resources .  Recent  work  has  converged  to  the 
notion  of  the  Grid ,  which  attempts  to  uniformly  present  a 
heterogeneous  collection  of  distributed  resources.  Current 
Grid  research  covers  many  areas  from  low  level  infrastruc¬ 
ture  issues  to  high  level  application  concerns.  However, 
providing  support  for  efficient  exploration  and  processing  of 
very  large  scientific  datasets  stored  in  distributed  archival 
storage  systems  remains  a  challenging  research  issue. 

We  have  initiated  an  effort  that  focuses  on  developing  ef¬ 
ficient  data-intensive  applications  in  a  Grid  environment.  In 
this  paper,  we  present  a  framework,  called  filter-stream  pro¬ 
gramming,  that  represents  the  processing  units  of  a  data- 
intensive  application  as  a  set  of  filters,  which  are  designed  to 
be  efficient  in  their  use  of  memory  and  scratch  space.  We  de¬ 
scribe  a  prototype  infrastructure  that  supports  execution  of 
applications  using  the  proposed  framework.  We  present  the 
implementation  of  two  applications  using  the  filter-stream 
programming  framework,  and  discuss  experimental  results 
demonstrating  the  effects  of  heterogeneous  resources  on  ap¬ 
plication  performance. 


1.  Introduction 

Increasingly  powerful  computers  have  made  it  possible 
for  computational  scientists  and  engineers  to  model  physi¬ 
cal  phenomena  in  greater  detail.  As  a  result,  overwhelming 
amounts  of  experimental  data  are  being  generated  by  scien¬ 
tific  and  engineering  simulations.  In  addition,  large  amounts 
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Grants  #ASC-96 1 9020  (UC  Subcontract  #101 52408),  and  by  the  Office  of 
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of  data  are  being  gathered  by  sensors  of  various  sorts,  at¬ 
tached  to  devices  such  as  satellites  and  microscopes.  There 
are  many  examples  of  large  useful  datasets  from  simula¬ 
tions  [26,  29,  33],  sensor  data  [25,  28],  and  medical  imag¬ 
ing  [2]  (pathology,  MRI,  CT  scan,  etc.).  The  primary  goal  of 
generating  data  through  large  scale  simulations  or  sensors  is 
to  better  understand  the  causes  and  effects  of  physical  phe¬ 
nomena.  Understanding  is  achieved  through  running  analy¬ 
sis  codes  on  the  stored  data,  or  by  a  more  interactive  visu¬ 
alization  that  relies  on  the  ability  to  gain  insight  from  look¬ 
ing  at  a  complex  system.  Thus,  both  data  analysis  and  visual 
exploration  of  large  datasets  plays  an  increasingly  important 
role  in  many  domains  of  scientific  research.  Decision  sup¬ 
port  database  applications  are  similar  to  scientific  applica¬ 
tions  because  they  deal  with  large  quantities  of  data  (rela¬ 
tional  data),  and  need  to  perform  significant  computation  in 
processing  the  data.  The  value  provided  by  decision  support 
systems  and  data-mining  algorithms  depend  greatly  on  the 
amount  of  data,  and  hence  businesses  are  inclined  to  retain 
as  much  data  as  possible. 

Disks  continue  to  become  larger  and  cheaper  making 
them  commodity  items.  This  helps  to  make  it  relatively  easy 
to  setup  a  large  set  of  archival  storage  disks  at  a  relatively 
low  cost.  For  example,  to  build  a  large  disk  farm  out  of  com¬ 
modity  PC  components  for  the  lowest  current  price:  $400 
for  a  motherboard  with  a  Celeron  or  AMD  K6-2  400MHz 
cpu  and  64MB  memory  [9],  four  40GB  EIDE  disks  at  $254 
each  [10]  and  a  fast  ethernet  interconnect  (100  Mbps),  a 
farm  of  8  PCs  can  present  1 .25TB  of  disk  space  for  less  than 
$15K.  The  price  point  is  sufficiently  low  to  enable  many 
such  disk  collections  to  be  setup  independently  at  multiple 
disparate  locations,  where  local  storage  needs  dictate.  We 
anticipate  that  this  trend  will  result  in  the  emergence  of  is¬ 
lands  of  data ,  where  cheap  archival  storage  systems  will 
be  used  to  hold  large  locally  generated  datasets.  Use  of 
computation  farms  also  is  important  for  handling  very  large 
datasets  in  a  reasonable  amount  of  time.  Oftentimes,  high 
performance  computation  farms  are  where  the  data  is  gen¬ 
erated  (as  in  large  scientific  simulations),  and  the  data  may 
reside  locally  on  the  computation  farm  in  an  archival  storage 
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system  such  as  HP SS  [22].  Thanks  to  high-performance  net¬ 
works,  increasing  numbers  of  computation  farms  have  be¬ 
come  accessible  across  a  wide-area  network.  These  com¬ 
putation  farms  span  a  spectrum  of  widely  varying  config¬ 
urations  and  computation  power,  from  relatively  inexpen¬ 
sive  network  of  workstations  and  PC  clusters  to  very  expen¬ 
sive  high-performance  machines,  providing  computing  per¬ 
formance  in  the  order  of  Teraflops. 

These  trends  combine  to  present  a  new  opportunity:  very 
large  distributed  datasets  that  can  be  used  by  applications 
for  computationally  and  data  intensive  analysis ,  exploration 
and  visualization. 

Consider  the  following  scenario:  A  scientist  wants  to 
compare  properties  of  a  3D  reconstructed  view  of  a  raw 
dataset  recently  generated  at  a  collaborating  institution,  with 
the  properties  of  a  large  collection  of  reference  datasets.  The 
3D  reconstruction  operation  involves  retrieving  portions  of 
2D  slices  from  the  regions  in  question,  and  then  perform¬ 
ing  feature  recognition  and  interpolating  between  the  slices 
to  extract  the  important  3D  features.  A  description  of  these 
features  and  the  associated  properties  are  then  compared 
against  a  database  of  known  features,  and  some  appropriate 
similarity  measure  is  computed.  The  final  result  is  the  set  of 
reference  features  found  that  are  close  in  some  way  to  those 
found  in  the  new  raw  dataset,  along  with  the  corresponding 
view  renderings  to  visualize. 


Figure  1. 3D  reconstruction/visualization  sce¬ 
nario  on  distributed  collection  of  resources. 

Consider  the  problems  that  can  occur  when  the  applica¬ 
tion  is  executed  in  a  Grid  [16]  environment.  That  is,  the  re¬ 
quired  resources  (new  raw  dataset,  reference  database,  and 
the  scientist)  are  all  at  distributed  locations  in  a  wide-area 
network  as  seen  in  Figure  1 .  The  reference  database  is  likely 
to  be  stored  in  an  image  library,  since  the  dataset  is  large  and 
useful  to  many  users.  The  new  raw  dataset  is  stored  at  the 


site  where  the  sensor  readings  were  taken.  If  the  hosts  con¬ 
taining  the  data  are  low-power  archival  systems  that  make 
the  execution  of  the  3D  reconstruction  code  prohibitively 
expensive,  it  becomes  unclear  how  to  structure  the  applica¬ 
tion  for  efficient  execution.  Ideally  we  would  like  to  execute 
portions  of  the  application  at  strategic  points  in  the  collec¬ 
tion  of  machines.  A  set  of  possible  locations  for  perform¬ 
ing  computation  is  indicated  in  the  figure  by  question  marks. 
For  example,  if  the  portion  of  code  that  performs  the  range 
select  on  the  new  raw  dataset  could  be  run  on  the  host  where 
the  data  lives,  the  amount  of  data  to  be  transmitted  over  the 
wide-area  network  (WAN)  would  be  reduced.  The  compu¬ 
tation  farm  is  an  ideal  location  for  the  feature  recognition 
and  3D  reconstruction  due  to  the  parallelism  inherent  in  the 
codes.  Given  the  set  of  features  that  were  identified,  it  would 
be  efficient  to  perform  the  selection  of  similar  features  from 
the  reference  database  on  the  data  server  where  the  database 
is  located.  The  low  end  PC  where  the  scientist  is  located  can 
be  used  to  collect  the  3D  rendering  and  the  similar  feature 
information  for  interactive  presentation  to  the  scientist. 

The  success  of  this  scenario  depends  on  the  application 
allowing  portions  of  its  computation  to  be  executed  in  a  dis¬ 
tributed  fashion.  Beyond  the  mere  possibility  of  execution 
in  a  distributed  environment  is  the  question  of  how  efficient 
the  application  is.  One  interpretation  of  efficiency  in  this 
context  is  the  ratio  of  useful  data  transmitted  to  the  total 
amount  of  data  transmitted  between  any  two  pieces  of  the 
application.  For  example,  if  an  application  transmitted  a  full 
dataset  from  a  remote  host,  and  discarded  a  large  portion  not 
required  by  subsequent  processing,  then  this  would  not  be 
considered  efficient  operation. 

We  have  initiated  an  effort  to  investigate  and  develop 
methodologies  and  a  framework  for  efficient  execution  of 
applications  that  make  use  of  distributed  collections  of 
datasets  in  a  Grid  environment.  There  are  two  main  chal¬ 
lenges  in  developing  efficient  applications  in  a  Grid  environ¬ 
ment: 

•  The  Grid  is  composed  of  collections  of  heterogeneous 
resources.  The  characteristics,  capacity  and  power 
of  resources,  including  storage,  computation,  and  net¬ 
work,  vary  widely.  This  requires  that  applications 
should  be  structured  to  accommodate  the  heteroge¬ 
neous  nature  of  the  Grid. 

•  These  distributed  resources  can  be  shared  by  many  ap¬ 
plications.  This  requires  that  applications  should  be  de¬ 
signed  to  be  optimized  in  their  use  of  shared  resources. 

In  order  to  address  these  challenges,  we  are  investigating: 

•  Methodologies  and  a  framework  for  structuring  appli¬ 
cations.  In  particular,  we  address  decomposition  of  ap¬ 
plication  processing  into  components  and  placement  of 
these  components  onto  a  collection  of  heterogeneous 
resources  that  will  aid  efficient  execution. 
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•  Feasibility  and  effects  of  exposing  application  structure 
and  characteristics.  In  particular,  we  address  exposing 
resource  requirements  and  the  communication  pattern 
between  application  components,  and  how  this  extra 
application  structure  information  can  be  used. 

•  An  infrastructure  for  providing  execution  of  applica¬ 
tions  that  conform  to  the  developed  framework. 


Figure  2.  A  simple  example  of  this  model  is  Unix  system 
pipes,  where  the  standard  output  of  a  process  is  used  as  stan¬ 
dard  input  for  another  process.  Unix  pipes  represent  a  linear 
chain  of  filters,  each  of  which  have  a  single  input  stream  and 
a  single  output  stream.  The  filter-stream  model  allows  for 
arbitrary  graphs  of  filters  with  any  number  of  input  and  out¬ 
put  streams. 


In  this  paper,  we  present  a  framework,  called  filter- 
stream  programming ,  that  represents  the  processing  in  a 
data-intensive  application  as  a  set  of  processing  units,  re¬ 
ferred  to  here  as  filters ,  which  are  designed  to  be  efficient 
in  their  use  of  memory  and  scratch  space.  In  this  frame¬ 
work,  data  exchange  between  any  two  filters  is  described 
via  streams ,  which  are  uni-directional  pipes  that  deliver  data 
in  fixed  size  buffers.  We  describe  a  prototype  infrastructure 
that  provides  support  for  execution  of  applications  using  the 
proposed  framework.  We  present  the  implementation  of  two 
applications  in  filter-stream  programming  framework,  and 
experimental  results  to  demonstrate  the  effects  of  heteroge¬ 
neous  resources  on  the  performance  of  the  applications. 

2.  The  Proposed  Approach 


In  this  section,  we  present  a  framework,  called  the  filter- 
stream  programming  model .  The  basic  ideas  are  to  ( 1 )  con¬ 
strain  application  components  to  allow  for  location  indepen¬ 
dence,  which  is  necessary  for  execution  in  a  distributed  envi¬ 
ronment,  and  (2)  expose  the  application  communication  pat¬ 
tern  and  resource  requirements,  allowing  a  runtime  system 
to  aid  in  efficient  execution.  We  should  note  that  any  pro¬ 
gramming  model  (e.g.,  message  passing)  modified  to  expose 
similar  constraints  could  be  employed  in  place  of  the  filter- 
stream  programming  model  we  describe. 

The  programming  model  used  in  this  work  is  derived 
from  the  stream-based  programming  model ,  originally  de¬ 
veloped  for  Active  Disks  [1,  35].  Many  stream-based  al¬ 
gorithms  were  developed  and  analyzed  for  Active  Disks. 
These  algorithms  carry  out  a  variety  of  data  transforma¬ 
tions  that  arise  in  earth  science  applications  and  applications 
of  standard  relational  database  sort,  select  and  join  opera¬ 
tions.  In  this  work  we  extend  these  algorithms  and  investi¬ 
gate  the  application  of  filters  and  the  stream-based  program¬ 
ming  model  in  a  Grid  environment. 

In  the  filter-stream  programming  model,  an  application 
is  represented  by  a  collection  of  filters.  A  filter  is  a  por¬ 
tion  of  the  full  application  that  performs  some  amount  of 
work.  Filters  are  required  to  pre-disclose  dynamic  memory 
and  scratch  space  needs.  Communication  with  other  filters 
is  solely  through  the  use  of  streams.  A  stream  is  a  com¬ 
munication  abstraction  that  allows  fixed  sized  untyped  data 
buffers  to  be  transported  from  one  filter  to  another.  An  ex¬ 
ample  set  of  filters  for  the  motivating  example  is  shown  in 
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Figure  2.  3D  reconstruction  application  de¬ 
composed  into  filters. 

The  process  of  manually  restructuring  an  application  us¬ 
ing  this  model  is  referred  to  as  decomposing  the  application. 
In  choosing  the  appropriate  decomposition,  we  need  to  con¬ 
sider  the  complete  data  flow  path  from  data  generation  to 
ultimate  consumption  and  the  target  machine  configuration, 
which  can  be  a  distributed  collection  of  heterogeneous  ma¬ 
chines.  The  main  goal  is  to  achieve  efficient  use  of  limited 
resources  in  a  distributed  and  heterogeneous  environment. 
The  choice  of  decomposition  can  have  a  significant  impact 
on  efficiency  and  performance.  Too  many  filters  could  mean 
there  is  not  much  work  for  individual  filters,  which  would 
cause  the  system  to  spend  much  of  its  time  moving  data 
around  and  little  time  performing  useful  work  on  the  data. 
Too  few  filters  could  limit  the  ability  of  the  overall  system 
to  execute  filters  concurrently.  Similarly,  sending  data  over 
streams  in  very  small  pieces  can  make  the  overhead  of  the 
runtime  system  too  large.  If  possible,  an  ideal  granularity 
size  should  balance  the  amount  of  computation  and  commu¬ 
nication  such  that  the  overall  processing  time  across  all  fil¬ 
ters  does  not  exhibit  a  penalty  merely  because  the  computa¬ 
tion  is  distributed. 

Given  a  set  of  filters,  the  runtime  mapping  of  filters  onto 
various  hosts  in  a  wide-area  grid  environment  is  referred  to 
as  placement.  Figure  3  shows  a  possible  placement  of  the 
filters  described  for  the  motivating  scenario.  The  choice  of 
placement  represents  the  main  degree  of  freedom  in  affect¬ 
ing  application  performance  by: 

•  placing  filters  with  affinity  to  data  sources  near  the 
sources, 

•  minimizing  communication  volume  on  slow  links, 

•  co-locating  filters  with  large  communication  volume, 

•  placing  computationally  intensive  filters  on  less  loaded 
hosts, 


118 


Reference  PB 


faaturelist 


(  Extract  raw} 
Sensor 

V. _ _ _ J 

n 

Raw  Dataset 

sensor  readings 

(view  result) 
Client  PC 


Figure  3.  Possible  placement  of  3D  recon¬ 
struction  application  filters. 


•  pipelining  application  filters  by  concurrent  execution. 

Note  that  a  placement  decision  is  not  assumed  to  be  static, 
and  the  programming  model  explicitly  supports  the  notion 
of  stopping  a  set  of  filters  and  replacing  them  with  possibly 
a  new  set  of  filters  with  a  different  placement. 

A  runtime  system  infrastructure  is  used  to  support  the  ex¬ 
ecution  of  applications  that  are  structured  in  the  filter-stream 
programming  model.  In  the  following  sections  we  present  a 
prototype  infrastructure  for  executing  application  filters,  and 
present  implementations  of  an  image  processing  application 
and  a  database  application  using  the  filter-stream  program¬ 
ming  model. 

3.  Related  Work 

There  is  a  large  body  of  hardware  and  software  research 
on  archival  storage  systems,  including  distributed  parallel 
storage  systems  [24],  file  systems  [34],  image  servers  [32], 
and  data  warehouses  [23].  Several  research  projects  have 
focused  on  digital  libraries  and  geographic  information  sys¬ 
tems  [4,  20]  that  access  collections  of  archival  storage  sys¬ 
tems,  high-performance  I/O  systems  [8],  tertiary  storage 
systems  [22]  and  remote  I/O  [19,  31].  Distributed  storage 
systems  attempt  to  provide  large  amounts  of  data  to  dis¬ 
tributed  clients.  They  present  a  uniform  view  of  distributed 
data  to  applications,  and  transparently  handle  replicas  and 
caching.  This  does  not  push  the  computation  to  the  data 
as  in  our  work,  rather  the  data  is  migrated  to  the  computa¬ 
tion,  but  can  achieve  a  similar  result  with  an  effective  re¬ 
placement  policy  and  a  warm  cache.  Another  issue  is  finding 
the  required  data.  The  Storage  Resource  Broker  (SRB)  [31] 
provides  uniform  UNIX-like  I/O  interfaces  and  meta-data 
management  services  to  locate  and  access  collections  of  dis¬ 
tributed  data  resources. 


Distributed  computing  covers  research  that  addresses 
ways  to  deal  with  distributed  execution  of  application  code 
in  many  different  ways.  Current  work  related  to  Grid  com¬ 
puting  [7,  14,  16]  attempts  to  provide  a  uniform  view  into 
a  collection  of  distributed  computational,  network  and  stor¬ 
age  resources,  and  to  provide  services  for  unified,  secure,  ef¬ 
ficient  and  reliable  access.  However,  providing  support  for 
efficient  exploration  and  processing  of  very  large  scientific 
datasets  stored  in  archival  storage  systems  at  distributed  lo¬ 
cations  remains  a  challenging  research  issue,  and  the  neces¬ 
sity  of  infrastructure  to  provide  such  support  was  recognized 
in  recent  Grid  forums  [21].  This  support  of  processing  and 
retrieval  for  efficient  operation  is  exactly  what  our  work  is 
attempting  to  provide. 

There  is  a  large  body  of  classic  work  on  dataflow  sys¬ 
tems.  The  macro  dataflow  model  [30,  36]  describes  an  ap¬ 
plication  as  a  sets  of  tasks,  communication  edges,  edge  com¬ 
munication  costs  and  task  computation  costs.  PYRROS  [37] 
uses  this  model  of  application  behavior  and  manual  annota¬ 
tions  to  cluster,  map,  and  schedule  computation  to  nodes  of 
a  homogeneous  parallel  machine.  As  we  target  a  heteroge¬ 
neous  grid  environment,  we  expand  on  assumptions  such  as 
constant  computation  regardless  of  placement,  which  makes 
sense  in  a  tightly  coupled  environment.  There  is  also  task 
parallel  work  in  systems  such  as  STRAND  [12],  PCN,  For¬ 
tran  M  [13],  and  HPF  [18],  which  are  related  due  to  the 
dataflow  model  and/or  task  parallelism  used.  Our  work  is 
different  in  that  we  are  considering  remote  datasource  affin¬ 
ity  as  a  primary  reason  for  decomposition,  rather  than  an  at¬ 
tempt  to  extract  paralellism. 

4.  A  Prototype  Infrastructure 

In  this  section,  we  describe  a  prototype  infrastructure  im¬ 
plementation  that  provides  support  for  execution  of  appli¬ 
cations  developed  using  the  filter-stream  framework.  This 
work  is  part  of  the  DataCutter  project  [6],  that  provides 
services  for  subsetting  and  processing  multi-dimensional 
datasets  stored  on  archival  storage  systems. 

4.1.  Filters 

A  filter  is  specified  by  the  code  to  execute,  and  a  descrip¬ 
tion  of  the  input  and  output  streams  it  will  use.  Currently, 
filter  code  is  expressed  using  a  C++  language  binding  by 
sub-classing  a  provided  filter  base  class.  This  base  class  pro¬ 
vides  a  well-defined  interface  between  the  filter  code  and 
the  system  filter  service.  The  description  of  input  and  out¬ 
put  streams  is  specified  in  a  separate  configuration  file  (Fig¬ 
ure  4). 

Filters  are  constrained  in  several  respects.  First,  undis¬ 
closed  dynamic  allocation  of  memory  and  local  disk  space 
is  not  allowed.  Instead ,  the  filter  must  pre-disclose  and  be 
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granted  scratch  memory  and  disk  space  by  the  runtime  sys¬ 
tem.  The  granted  scratch  space  is  allocated  on  behalf  of  the 
filter  by  the  runtime  system  when  the  filter  is  instantiated. 
Later,  the  filter  may  make  use  of  the  granted  scratch  space  as 
needed.  One  of  the  potential  benefits  of  exposing  resource 
requirements  in  this  way  is  that  runtime  system  can  achieve 
a  better  placement  of  filters.  For  example,  a  filter  can  be  run 
on  a  machine  with  enough  memory  to  avoid  paging,  and  two 
filters  requesting  large  scratch  space  can  be  placed  on  two 
different  machines.  In  addition,  the  runtime  system  can  po¬ 
tentially  perform  better  scheduling  of  co-located  filters  on  a 
machine.  One  of  our  goals  in  this  project  is  to  investigate 
and  assess  the  potential  benefits  of  pre-allocating  memory, 
when  it  will  really  be  important,  and  implications  for  struc¬ 
turing  applications.  In  order  to  accomplish  this,  we  plan  to 
compare  standard  versions  of  target  applications  with  filter- 
stream  based  implementations  in  subsequent  work. 

The  interface  for  filters  consists  of  an  initialization  func¬ 
tion,  a  processing  function,  and  a  finalization  function: 


class  MyFilter :  public  AS.Filter.Base  { 
public: 

int  init(int  argc,  char  *argv[])  {...}; 
int  process(stream_t  st)  {  . . .  }; 
int  finalize(void) 

} 


The  init  function  is  called  when  the  filter  is  instantiated, 
and  is  passed  parameters  with  the  command  line  arguments 
used  when  the  application  was  started.  This  is  where  a  fil¬ 
ter  would  request  scratch  memory  space  for  use  during  later 
processing,  for  example.  The  process  function  is  called  to 
handle  data  arriving  on  the  input  streams  in  buffers  from  the 
sending  filter.  The  parameter  passed  to  the  process  function 
contains  arrays  of  descriptors  for  the  sets  of  input  streams 
and  output  streams  this  filter  can  use.  The  filter  can  only  read 
and  write  from/to  the  provided  streams.  No  new  streams  can 
be  created  by  the  filter  at  runtime.  The  finalize  function  is 
called  after  all  processing  is  finished  and  the  filter  is  ready  to 
terminate.  This  is  where  a  filter  would  release  any  resources 
in  use. 

Another  restriction  is  that  a  filter  cannot  change  the 
source  of  its  input  streams  nor  the  sinks  of  its  output  streams. 
This  has  two  advantages.  First,  a  filter  does  not  need  to 
handle  buffering  and  scheduling  for  its  own  communication, 
thereby  reducing  the  complexity  of  filters.  Second,  the  loca¬ 
tion  of  filters  is  transparent,  allowing  filters  to  be  placed  at 
different  locations  initially  and  relocated  as  system  resource 
constraints  change. 

Filters  are  the  unit  of  placement.  Each  filter  can  poten¬ 
tially  be  executed  on  a  different  host.  In  addition,  a  filter’s 
location  may  change  at  discrete  application-defined  inter¬ 
vals  during  the  course  of  execution.  Note  this  does  not  imply 
true  migration  of  code  and  state,  but  rather  placement  can 


be  recomputed  and  the  filter  can  be  stopped  on  the  original 
host  and  a  new  copy  re-instantiated  on  the  new  host.  There 
is  a  limited  mechanism  for  a  final  state  transfer  by  a  sin¬ 
gle  buffer  transfer  from  the  old  instance  to  the  new  instance. 
This  approach  avoids  many  of  the  details  involved  in  check¬ 
pointing  and  process  migration  [11],  while  retaining  most  of 
the  benefits.  Filters  need  to  be  structured  appropriately  to 
handle  such  events.  For  cases  when  this  is  not  desirable,  a 
filter  can  be  pinned  to  a  particular  host,  which  means  the  fil¬ 
ter  will  always  be  placed  on  that  host.  This  host  affinity  is 
useful  for  some  situations,  such  as  when  runtime  libraries  or 
auxiliary  data  files  only  exist  on  a  particular  host,  but  does 
limit  placement  flexibility. 

4.2.  Streams 

A  stream  is  an  abstraction  used  for  filter  communica¬ 
tion.  Since  the  placement  of  filters  is  largely  unknown  un¬ 
til  runtime,  this  mechanism  is  used  to  achieve  location- 
independent  filter  code  because  stream  names  are  used 
rather  than  endpoint  location  on  a  specific  host.  A  stream 
is  used  to  specify  how  filters  are  logically  connected,  and 
to  provide  the  glue  at  runtime  to  attach  an  input  stream  for 
one  filter  to  an  output  stream  of  another.  All  transfers  to 
and  from  streams  are  through  a  provided  buffer  abstraction. 
A  buffer  represents  a  contiguous  memory  region  containing 
useful  data.  The  buffer  contains  a  pointer  to  the  start,  the 
length  of  the  portion  containing  useful  data,  and  the  maxi¬ 
mum  size  of  the  buffer.  In  the  current  prototype  implemen¬ 
tation  we  are  using  TCP  for  stream  communication,  but  any 
point-to-point  communication  library  could  be  added,  such 
as  Nexus  [17]. 

The  streams  are  specified  in  a  global  sense,  separate  from 
the  application  code.  For  each  filter,  a  list  of  input  and  output 
streams  is  required.  This  discloses  all  potential  filter  com¬ 
munication  pairs  for  the  entire  execution  of  the  application. 
Given  a  set  of  filters  with  stream  connectivity  information, 
we  can  build  a  task  graph  where  the  nodes  represent  the  fil¬ 
ters,  and  the  edges  represent  stream  connections.  For  exam¬ 
ple,  given  three  filters  A,  B  and  C,  with  data  being  sent  from 
A  to  both  B  and  C,  and  from  B  to  C,  the  specification  and  re¬ 
sulting  task  graph  are  seen  in  Figure  4.  Each  filter  in  the  spec 
appears  in  a  section  labeled  [filter.<name>].  For  each  sec¬ 
tion,  two  optional  entries  ins  and  outs  can  appear  containing 
the  list  of  input  and  output  stream  names  respectively. 

In  addition  to  the  above  inter-filter  streams,  we  allow 
for  two  other  types  of  streams:1  File  Streams  and  External 
Communication  Streams.  Files  Streams  are  used  to  read  and 
write  to  files  stored  in  local  scratch  disk  space  or  local  per¬ 
manent  disk  storage.  The  file  stream  abstraction  further  in¬ 
sulates  the  filter  code  from  specifics  about  the  host  system. 
This  provides  a  measure  of  safety  between  co-located  filters, 

1  These  are  not  yet  implemented  in  the  prototype. 
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[filter.  A] 

outs  =  stream  1  stream3 
[filter.B] 
ins  =  stream  1 
outs  =  stream2 

[filter.C] 

ins  =  stream2  stream3 


(a)  filter/stream  spec  (b)  resulting  graph 

Figure  4.  Sample  filter/stream  specification. 

since  one  filter  cannot  access  another  filter’s  scratch  disk 
space.  The  permanent  disk  storage  presents  a  uniform  file 
system  to  all  filters,  similar  to  a  traditional  file  system.  Thus 
a  filter  with  sufficient  authorization  can  read  files  in  perma¬ 
nent  disk  storage  written  by  another  filter.  External  Commu¬ 
nication  Streams  are  used  to  connect  to,  and  receive  connec¬ 
tions  from  legacy  or  other  non-filter  application  code. 

4.3.  Execution  Environment 

The  execution  service  performs  all  the  steps  necessary  to 
instantiate  filters  on  particular  machines,  connect  all  the  log¬ 
ical  stream  endpoints,  and  call  the  interface  functions  to  al¬ 
low  filters  to  run. 

The  description  of  where  to  instantiate  filters  is  provided 
by  a  placement  specification.  Currently,  this  is  statically 
generated  before  the  application  is  started.  An  example 
placement  specification  for  the  sample  filters  is: 


[placement] 

A  =  host1.cs.umd.edu 
B  =  host2.cs.umd.edu 
C  =5  host3.cs.umd.edu 


The  [placement]  section  is  expected  to  contain  one  en¬ 
try  for  each  filter.  The  value  is  simply  the  host  to  execute 
the  filter  on.  In  general,  this  host  can  be  a  parallel  ma¬ 
chine,  which  implies  multiple  instances  of  the  filter  are  cre¬ 
ated,  but  the  prototype  implementation  does  not  yet  support 
parallel  filters.  Security  concerns  have  made  it  difficult  to 
start  processes  on  remote  machines  in  a  uniform  manner. 
To  solve  this  problem  in  the  current  prototype,  an  Applica¬ 
tion  Execution  Daemons  (appd)  must  be  run  on  every  host 
used  to  execute  filters.  In  the  future,  we  plan  to  use  exist¬ 
ing  Globus  [15]  services  for  process  creation  and  authenti¬ 
cation,  in  which  case  the  Application  Execution  Daemons 
would  not  be  needed.  In  addition,  a  single  provided  Direc¬ 
tory  Daemon  (dird),  which  is  similar  to  an  LDAP  server,  is 
used  to  record  the  contact  information  (host,  port,  pid)  for 
each  appd.  The  dird  is  the  only  process  that  runs  on  a  well- 
defined  host  and  port.  All  other  ports  are  ephemeral,  and 


registered  with  the  dird  to  later  be  queried.  Based  on  a  given 
placement  specification,  the  execution  of  a  filter-based  ap¬ 
plication  requires  contacting  the  appd  process  on  each  host. 
A  lookup  is  performed  to  find  contact  information  for  each 
required  appd.  Currently,  we  require  an  application  binary 
to  exist  on  every  host,  which  must  contain  at  least  the  code 
for  the  filters  that  will  execute  on  that  host.  The  binary  can 
contain  code  for  all  the  filters,  and  those  filters  not  intended 
to  run  on  a  given  host  will  not  be  instantiated  at  runtime. 
Currently  we  manually  compile/copy  the  binaries  as  needed, 
but  convenience  procedures  to  do  this  will  be  added  in  the 
future. 

The  application  is  started  by  running  the  application  bi¬ 
nary  on  some  host.  This  will  become  the  console  process, 
which  performs  no  application  processing  such  as  running 
filters.  The  console  process  queries  the  dird  process  to  get 
the  relevant  appd  contact  information,  and  then  sends  an  ex¬ 
ecute  command  to  each  appd.  The  appd  executes  the  appli¬ 
cation  binary  on  that  host,  which  in  turn  contacts  the  console 
process  and  performs  some  initial  handshaking  to  setup  the 
stream  abstractions.  In  the  current  prototype,  one  POSIX 
thread  is  created  for  each  filter  that  runs  on  the  host,  and 
a  new  instance  of  the  application  filter  object  is  created. 
The  thread  calls  the  init  interface  function  passing  the  com¬ 
mand  line  arguments  that  were  used  when  the  console  pro¬ 
cess  was  started.  Next,  the  thread  calls  the  process  function. 
When  this  returns,  all  open  streams  are  closed  and  the  final¬ 
ize  function  is  called.  Any  remaining  filter  resources  are  re¬ 
leased  before  the  thread  stops. 

The  multiple  threads  allow  for  fairness  across  filters  on  a 
single  host,  since  all  threads  are  executed  with  the  same  pri¬ 
ority  by  the  underlying  operating  system.  No  one  filter  can 
starve  another  due  to  the  time  sharing  semantics  of  POSIX 
threads.  Of  course  the  filters  do  need  to  be  thread-safe  with 
respect  to  each  other.  Based  on  the  filter-stream  program¬ 
ming  model,  this  should  be  natural  for  most  applications. 
Filters  in  this  model  are  inherently  isolated  and  communi¬ 
cate  via  system  provided  buffers,  thus  should  be  fairly  easy 
to  make  thread-safe  due  to  the  lack  of  shared  resources.  One 
problem  could  be  common  library  routines.  For  the  cases 
where  no  thread-safe  implementations  exist,  we  provide  fil¬ 
ter  level  locks  that  can  be  used  to  wrap  the  offending  calls. 
This  is  only  an  issue  when  thread  safety  problems  exist  be¬ 
tween  filters  that  run  on  the  same  host,  thus  in  the  same  pro¬ 
cess.  For  the  sample  placement,  filters  A,  B  and  C  can  all 
have  thread  safety  violations,  since  they  are  all  actually  run 
in  separate  processes  on  three  different  hosts. 

For  cases  when  thread  safety  is  a  problem  and  lock  wrap¬ 
ping  will  not  work,  the  infrastructure  could  be  augmented  to 
optionally  use  a  single  thread  for  all  filters  on  a  given  host. 
Control  could  use  a  dataflow  model  where  scheduling  is  per¬ 
formed  by  the  infrastructure  for  filters  based  on  the  arrival 
of  input.  Another  alternative  re-design  is  to  make  each  filter 
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execute  as  a  separate  process,  thus  avoiding  all  threading  is¬ 
sues  at  the  expense  of  increased  filter  communication  costs 
on  the  same  host.  The  use  of  a  thread-per-filter-instance  is  a 
property  of  the  current  prototype  implementation,  and  is  not 
mandated  by  the  overall  model. 

4.4.  Applicability 

Our  approach  is  intended  to  be  applicable  to  many  com¬ 
mon  types  of  data-intensive  applications  that  are  emerging 
for  use  in  a  grid  environment.  The  benefits  of  this  approach 
result  directly  from  two  observations.  The  first  is  that  the 
filter-stream  framework  exposes  useful  information,  partic¬ 
ularly  application  communication  pattern  and  communica¬ 
tion  volume  information.  The  second  is  that  expressing  the 
application  processing  as  filters  enables  data  volume  reduc¬ 
tion  from  remote  data  sources.  These  factors  can  be  lever¬ 
aged  to  improve  application  efficiency  at  runtime. 

We  recognize  that  the  approach  may  not  be  effective  for 
all  application  types,  and  are  identifying  characteristics  that 
make  applications  ill-suited  for  this  approach.  Ill-suited  in 
this  case  means  performance  will  be  no  better  than  that  of 
a  generic  message  passing  implementation,  for  example  us¬ 
ing  MPI  [27].  The  first  problem  occurs  when  applications 
have  high  selectivity.  This  means  nearly  all  the  remote  data 
is  needed  by  the  application,  and  no  significant  data  reduc¬ 
tion  is  achieved,  which  will  nullify  the  benefits  of  applica¬ 
tion  decomposition. 

Applications  that  lack  a  clear  task  structure  are  also  prob¬ 
lematic.  If  the  application  cannot  be  divided  cleanly  into  a 
set  of  filters,  then  placement  choices  are  more  limited  for 
such  a  monolithic  application.  For  example,  if  an  applica¬ 
tion  uses  two  remote  data  sources  and  cannot  be  divided  into 
filters,  we  can  execute  the  application  at  either  data  source 
(inputs),  the  client  (output),  or  at  an  intermediate  location. 
This  will  most  likely  be  efficient  only  for  data  located  at  the 
execution  site  chosen,  and  inefficient  for  other  input/output 
data  sources/sinks. 

The  communication  pattern  and  volume  are  significant 
characteristics  that  enable  intelligent  placement  to  overlap 
communication  with  computation  and  reduce  high  volume 
on  slow  network  links.  If  the  pattern  or  volume  of  com¬ 
munication  is  unknown,  chaotic,  very  fine  grained,  or  time 
varying,  then  it  is  difficult  to  perform  an  intelligent  place¬ 
ment.  For  example,  a  communication  pattern  that  involves 
all  possible  filters  and  is  data  dependent,  where  the  destina¬ 
tion  for  a  piece  of  data  is  known  only  after  its  examination, 
will  result  in  a  conservative  approximation  of  an  all-to-all 
pattern  with  equal  volume  between  all  pairs  of  filters.  There 
is  no  clear  choice  for  placement  in  this  case,  because  any 
possible  good  placement  may  only  be  known  after  execu¬ 
tion  has  finished  and  the  communication  activity  has  been 
observed.  Even  worse,  the  observed  communication  pattern 


and  volume  may  not  be  helpful  for  future  runs,  due  to  non¬ 
determinism  in  such  applications.  Our  approach  assumes  a 
single  significant  communication  pattern  and  deterministic 
volume,  which  can  be  used  for  choosing  placement  for  the 
entire  execution.  For  the  applications  we  are  targeting,  such 
as  volume  visualization,  database  decision  support,  and  im¬ 
age  processing,  these  assumptions  appear  to  hold. 

5.  Application:  Image  Processing 

The  Virtual  Microscope  [2]  is  a  query-response  appli¬ 
cation  that  processes  multi-dimensional  image  data  to  sat¬ 
isfy  client  queries.  The  dataset  contains  high  power  digi¬ 
tized  images  of  microscope  slides,  which  effectively  forms 
a  3D  dataset  because  each  slide  can  contain  multiple  2D  fo¬ 
cal  planes  at  different  depths.  Images  are  stored  at  the  high¬ 
est  magnification  level,  and  the  size  of  a  single  slide  typi¬ 
cally  varies  from  100MB  to  5GB,  compressed.  The  sys¬ 
tem  is  required  to  provide  interactive  response  times  simi¬ 
lar  to  a  physical  microscope,  including  continuously  mov¬ 
ing  the  stage  and  changing  magnification.  A  typical  query 
allows  a  client  to  request  a  2D  rectangular  region  at  a  partic¬ 
ular  magnification  from  within  the  bounds  of  a  single  focal 
plane.  The  processing  for  the  query  requires  projecting  high 
resolution  data  onto  a  grid  of  suitable  resolution  (governed 
by  the  desired  magnification)  and  appropriately  composit¬ 
ing  pixels  that  map  to  a  single  grid  point  to  avoid  introduc¬ 
ing  spurious  artifacts  into  the  displayed  image.  The  Virtual 
Microscope  is  useful  for  performing  operations  that  are  diffi¬ 
cult  with  a  physical  microscope,  such  as  simultaneous  view¬ 
ing  and  manipulation  of  a  single  slide  by  multiple  users,  or 
remote  telepathology  [2]  where  diagnosing  pathologists  are 
not  required  to  be  physically  located  near  the  slide. 

5.1.  Original  Implementation 

The  original  Virtual  Microscope  system  is  composed  of 
two  components;  a  client  to  generate  queries  and  display  the 
results  (i.e.  images),  and  a  server  to  process  the  queries.  The 
server  is  composed  of  a  frontend  and  a  backend.  The  fron- 
tend  interacts  with  clients;  it  receives  queries  from  clients 
and  forwards  them  to  the  backend.  The  backend  consists  of 
one  or  more  processes,  typically  one  per  node  of  a  parallel 
machine.  The  processing  of  a  query  is  carried  out  entirely  in 
the  backend. 

In  order  to  achieve  high  I/O  bandwidth,  each  focal  plane 
in  a  slide  is  regularly  partitioned  into  data  chunks,  each  of 
which  is  a  rectangular  subregion  of  the  2D  image.  Data 
chunks  are  declustered  across  all  backend  local  disks  to 
achieve  I/O  parallelism.  Each  pixel  in  a  chunk  is  associated 
with  a  coordinate  (in  x-  and  y-dimensions)  in  the  entire  im¬ 
age.  Each  chunk  has  an  associated  minimum  bounding  rect¬ 
angle  (MBR)  based  on  all  the  pixels  in  the  chunk.  An  index 
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is  created  using  the  MBR  of  each  chunk.  Since  the  image  is 
regularly  partitioned  into  rectangular  regions,  a  simple  com¬ 
putation  can  be  used  instead  of  a  complex  index  search. 

During  query  processing,  the  backend  process  finds  the 
chunks  that  intersect  the  query  region,  and  reads  them  from 
the  local  disks.  Each  data  chunk  is  stored  in  compressed 
form  (JPEG  format),  and  must  be  first  decompressed.  Then, 
it  is  clipped  to  the  query  region.  Afterwards,  each  clipped 
chunk  is  subsampled  to  achieve  the  zoom  level  (magnifica¬ 
tion)  specified  in  the  query.  The  resulting  image  blocks  are 
directly  sent  to  the  client.  The  client  viewer  assembles  and 
displays  the  image  blocks  from  each  of  the  backend  pro¬ 
cesses  to  form  the  query  output. 

5.2.  Filter  Implementation 

The  filter  decomposition  used  for  the  Virtual  Microscope 
system  [6]  is  shown  in  Figure  5.  This  filter  pipeline  struc¬ 
ture  is  natural  for  query-response  applications.  The  figure 
only  depicts  the  main  dataflow  path  of  image  data  through 
the  system;  other  low- volume  streams  related  to  the  client- 
server  protocol  are  not  shown  for  clarity.  The  thickness  of 
the  stream  arrows  indicate  the  relative  volume  of  data  that 
flows  on  the  different  streams. 


Figure  5.  Virtual  Microscope  decomposition 

In  this  implementation  each  of  the  main  processing  steps 
in  the  server  is  a  filter: 

•  read-data:  Full-resolution  data  chunks  that  intersect 
the  query  region  are  read  from  disk,  and  written  to  the 
output  stream. 

•  decompress:  Image  blocks  are  read  individually  from 
the  input  stream.  The  block  is  decompressed  using 
JPEG  decompression  and  converted  into  a  3  byte  RGB 
format.  The  image  block  is  then  written  to  the  output 
stream. 

•  clip:  Uncompressed  image  blocks  are  read  from  the  in¬ 
put  stream.  Portions  of  the  block  that  lie  outside  the 
query  region  are  removed,  and  the  clipped  image  block 
is  written  to  the  output  stream. 

•  zoom:  Image  blocks  are  read  from  the  input  stream, 
subsampled  to  achieve  the  magnification  requested  in 
the  query,  and  then  written  to  the  output  stream. 

•  view:  Image  blocks  are  received  for  a  given  query,  col¬ 
lected  into  a  single  reply,  and  sent  to  the  client  using  the 
standard  Virtual  Microscope  client/server  protocol. 

Figure  6  illustrates  the  high-level  code  for  the  zoom  fil¬ 
ter,  which  has  two  input  streams  and  one  output  stream.  It 


VM_zoom: : init ( )  { 

//  Allocate  output  buffer  from  pre-allocated  scratch  space 
bufOut  =  AllocFromScratch(getOutputStreamBuf ferSize ( ) ) ; 

VM_zoom: : process (stream_t  &st)  { 

DC_StreamBuf fer  *buf; 

VMQuery  * query ; 

VMChunk  * chunk; 

//  recv  the  query 

buf  =  st.ins [0] .read() ;  query  =  VMUnpackQuery (buf ) ; 

//  while  there  is  data  retrieved  from  input  stream 
while  ((buf  =  st.ins II] .read() )  1=  NULL)  { 

chunk  =  VMUnpackChunk (buf) ;  //  extract  chunk  information 
zoom_chunk (chunk,  query);  //  perform  zoom  operation 
bufOut  =  VMPackChunk ( chunk ) ;  //  pack  chunk  into  buffer 
•t .outs [0] .write (&buf Out ) ;  //  write  data  to  output  stream 
FreeToScratch (chunk- >Data ) ; 

} 

} 

VM_zoom: : finalize ( )  { 

FreeToScratch (buf Out) ; 

} 

void  VM_zoom: :  zoom_chunk{ VMChunk  *chunk,  VMQuery  *query)  { 
int  rel_zoom  =  query->Zoom/chunk->Zoom; 
int  width  =  chunk- >Width/rel_zoom; 
int  height  =  chunk- >Height/rel_zoom; 
int  size  =  width*height*PlXELSiZE; 

char  *pSrc  =  chunk->Data; 

char  *pDst  =  chunk- >Data  =  AllocFromScratch(size) ; 

//  subsample  the  image  block 
for  (j  =  height;  j>0;  — j)  { 
for  (i  =  width;  i>0;  — i)  { 
memcpy (pDst,  pSrc,  PIXELSIZE); 
pSrc  +=  rel_zoom*PIXELSIZE; 
pDst  +-  PIXELSIZE; 

} 

pSrc  +=  rel_zoom* chunk. Width*PIXELSIZE; 

} 

//  update  chunk  metadata 
chunk->Zoom  =  query->Zoom; 


Figure  6.  The  high-level  code  for  zoom  filter. 


reads  the  query  from  stream  0  (st.ins[0])  and  data  chunks 
from  stream  1  (st.insfl]),  and  subsamples  the  received  data 
chunks  using  the  zoom_chunk  function.  The  zoom  filter 
uses  scratch  space  to  store  results  during  subsampling  and 
to  pack  the  subsampled  chunk  into  the  output  buffer.  The 
result  is  written  to  the  output  stream  (st.outs[01),  which  con¬ 
nects  the  filters  zoom  and  view. 

5.3.  Experimental  Results 

Using  the  filters  described  in  Section  5.2,  we  have  im¬ 
plemented  a  simple  data  server  for  digitized  microscopy  im¬ 
ages  [6],  stored  in  the  IBM  HPSS  archival  storage  system  at 
the  University  of  Maryland.  An  existing  Virtual  Microscope 
client  trace  driver  was  used  to  drive  the  experiments.  This 
driver  was  always  executed  on  the  same  host  as  the  view  fil¬ 
ter,  which  is  referred  to  as  the  client  host.  The  server  host  is 
where  the  readjdata  filter  is  run,  which  is  the  machine  con¬ 
taining  the  disks  with  the  dataset. 

The  HPSS  setup  has  10TB  of  tape  storage  space,  500GB 
of  disk  cache,  and  is  accessed  through  a  10-node  IBM  SP.  In 
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Figure  7.  2D  dataset  and  query  regions.  The 
table  lists  transmitted  data  sizes  for  g5.  zoom 
(no)  is  for  no  subsampling  and  zoom  (8)  is  for 
a  subsampling  factor  of  8  (in  each  of  the  two 
spatial  dimensions). 


all  experiments  we  use  a  4GB  2D  compressed  JPEG  image 
dataset  (90GB  uncompressed),  created  by  stitching  together 
smaller  digitized  microscopy  images.  This  dataset  is  equiv¬ 
alent  to  a  digitized  slide  with  a  single  focal  plane  that  has 
180/1"  x  180 K  RGB  pixels.  The  2D  image  is  regularly  par¬ 
titioned  into  200  x  200  data  chunks  and  stored  in  HPSS  in  a 
set  of  files.  We  defined  five  possible  queries,  each  of  which 
covers  5x5  chunks  of  the  image  (see  Figure  7).  The  ex¬ 
ecution  times  we  will  show  are  response  times  seen  by  the 
visualization  client  averaged  over  5  repeated  runs.  For  the 
presented  experiments,  we  eliminated  the  effects  of  retriev¬ 
ing  data  stored  on  tape  by  insuring  the  data  was  staged  to  the 
HPSS  disk  cache  before  each  run.  We  are  using  machines 
on  our  local  area  network  for  experimental  repeatability,  and 
will  switch  to  hosts  in  a  wide-area  Grid  environment  once 
application  behavior  is  sufficiently  well-understood. 

Overhead  of  Using  Filters.  The  query  execution  times  for 
the  original  optimized  Virtual  Microscope  server  versus  the 
prototype  filter  implementation  are  shown  in  Figure  8.  In 
this  experiment  the  entire  dataset  is  loaded  from  HPSS  and 
stored  on  a  single  local  disk  attached  to  a  SUN  Ultra  1  work¬ 
station,  because  the  original  server  can  only  access  datasets 
stored  on  disks.  The  loading  of  the  dataset  took  4750  sec¬ 
onds  (1  hour  19  minutes).  The  original  server  is  run  as  a 
single  process,  and  all  filters  in  the  filter-stream  implemen¬ 
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Figure  8.  Query  execution  times  for  the  origi¬ 
nal  server  and  the  filter  implementation,  (sub¬ 
sampling  factor  is  8) 


tation  are  executed  on  the  same  uniprocessor  SUN  worksta¬ 
tion  where  the  dataset  has  been  pre-loaded.  In  both  cases 
the  client  is  run  on  another  SUN  Ultra  1  workstation  con¬ 
nected  to  the  local  Ethernet  segment.  As  is  seen  from  the  fig¬ 
ure,  the  filter  implementation  does  not  introduce  much  over¬ 
head  compared  to  the  optimized  original  server.  The  percent 
increase  in  query  execution  time  ranges  from  6%  to  30% 
across  all  queries.  The  filter  version  contains  extra  work  not 
present  in  the  original  server,  such  as  flattening  of  the  chunk 
and  metadata  into  a  linear  buffer  on  the  sending  filter,  and 
expanding  the  chunk  and  metadata  into  the  same  structure  in 
the  receiving  filter.  This  overhead  is  necessary  when  filters 
are  located  on  distributed  machines,  but  could  be  eliminated 
for  the  co-located  case  by  instead  sending  a  pointer  to  an  in¬ 
memory  structure,  which  would  eliminate  much  of  the  over¬ 
head.  This  experiment  is  designed  to  be  biased  against  the 
filter  implementation  to  see  what  the  overhead  is  in  the  de¬ 
composed  version.  We  should  also  note  that  the  timings  do 
not  include  the  time  for  loading  the  dataset  from  tape,  which 
can  substantially  increase  for  larger  datasets  and  datasets 
stored  in  archival  storage  systems  across  a  wide-area  net¬ 
work. 

Varying  the  Processing.  One  node  of  the  IBM  SP  is  used 
to  access  the  stored  dataset,  and  the  client  was  run  on  a  SUN 
workstation  connected  to  the  SP  node  through  the  depart¬ 
ment  Ethernet.  We  experimented  with  different  placements 
of  the  filters  by  running  some  of  the  filters  on  the  same  SP 
node  where  the  data  is  accessed,  as  well  as  on  the  SUN  work¬ 
station  where  the  client  is  run. 

In  Figure  9  we  consider  varying  the  placement  of 
the  filters  under  different  processing  requirements.  Fig- 
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Filter  Placement  /  Query 


Filter  Placement  /  Query 


(a)  no  zooming  (no  subsampling) 


(b)  subsampling  by  a  factor  of  8 


Figure  9.  Execution  time  of  queries  under  varying  processing  (subsampling).  R,D,C,Z,V  denote  the 
filters  read-data,  decompress,  clip,  zoom,  and  view,  respectively.  <server>-<client>  denotes  the  placement 
of  the  filters  in  each  set. 


ures  9(a)  and  (b)  show  the  query  execution  times  when  the 
image  is  viewed  at  the  highest  magnification  (no  subsam¬ 
pling)  and  when  the  subsampling  factor  is  8  (i.e.  every  8th 
pixel  in  each  dimension  is  output),  respectively.  There  are 
three  predominant  factors  in  these  experiments.  The  first 
is  the  placement  of  the  most  computationally  intensive  fil¬ 
ter  ( decompress ).  The  second  is  the  volume  of  data  trans¬ 
mitted  between  the  two  machines.  The  final  factor  is  the 
amount  of  data  sent  by  the  view  filter  to  the  client  driver. 
Consider  the  first  two  groups  of  bars  in  the  figures.  The  dif¬ 
ference  between  the  groups  within  each  figure  is  the  place¬ 
ment  of  the  zoom  filter  on  the  server  (RDCZ-V)  or  client  host 
(RDC-ZV).  When  there  is  no  subsampling,  query  execution 
times  remain  almost  the  same  for  both  placements,  because 
the  volume  of  data  transfer  between  the  server  and  client 
is  the  same  in  both  cases.  In  the  case  of  subsampling,  the 
placement  of  the  zoom  filter  makes  a  difference,  because  the 
volume  of  data  sent  from  the  server  to  the  client  decreases 
if  the  zoom  filter  is  executed  at  the  server.  Now  consider 
the  last  two  groups  of  bars  in  the  figures.  The  difference 
between  the  groups  within  each  figure  is  the  placement  of 
the  decompress  filter  (RD-CZV  or  R-DCZV).  For  no  sub¬ 
sampling  case,  the  time  increases  substantially  when  decom¬ 
press  is  placed  on  the  client,  because  of  the  combined  ef¬ 
fects  of  the  most  computationally  intensive  filter  (decom¬ 
press)  and  the  high  amount  of  data  being  processed  by  view 
and  sent  to  the  client  driver.  When  there  is  subsampling,  the 
query  execution  time  is  not  as  high,  because  the  amount  of 
data  processed  by  view  and  sent  to  the  client  driver  is  much 
lower.  These  experiments  demonstrate  the  complex  inter¬ 
actions  between  placement  of  computation  and  communica¬ 


tion  volume. 

Varying  the  Server  Load.  In  the  next  set  of  experiments 
(Figure  10),  we  consider  varying  the  server  load.  We  use 
the  same  experimental  setup  as  for  the  previous  experiment. 
In  all  experiments,  we  use  a  subsampling  factor  of  8.  Fig¬ 
ures  10(a),  (b),  and  (c)  show  query  execution  times  when  the 
server  load  is  doubled,  tripled,  and  quintupled,  respectively. 
The  different  loads  were  emulated  by  artificially  slowing 
down  the  set  of  filters  running  on  the  server  host  such  that  the 
total  running  time  was  delayed.  For  example,  the  zoom  filter 
runs  twice  as  long  in  the  2  x  case  because  the  time  is  delayed. 
As  server  load  increases  (or  the  client  host  becomes  rela¬ 
tively  faster),  running  the  filters  on  the  client  host  achieves 
better  performance.  This  result  is  not  unexpected,  but  the 
experiment  quantifies  the  effect  for  this  particular  configura¬ 
tion.  The  use  of  a  different  client  to  server  network,  or  hosts 
with  different  relative  speeds  would  significantly  change  the 
observed  trends  and  trade-off  points. 

6.  Application:  External  Sort 

External  sort  has  a  long  history  of  research  in  the  database 
community  and  has  resulted  in  many  fast  algorithms  [3,  5], 
The  application  starts  with  a  large  unsorted  data  file  that  is 
partitioned  across  multiple  nodes,  and  the  output  is  a  new 
partitioned  data  file  that  contains  the  same  data  sorted  on  a 
key  field.  The  sample  data  file  is  based  on  a  standard  sort¬ 
ing  benchmark  that  specifies  100  byte  tuples,  with  the  first 
10  bytes  being  the  sort  key.  The  distribution  of  the  key  val¬ 
ues  is  assumed  to  be  uniform,  both  in  terms  of  the  unsorted 
file  as  a  whole  and  for  each  partition.  A  recent  record  holder 
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Fitter  Placement  /  Query 


Fitter  Placement  /  Query 


Fitter  Placement  /  Query 


(a)  2x  server  load  (b)  3x  server  load  (c)  5x  server  load 

Figure  10.  Execution  time  of  queries  under  varying  server  load.  2x  means  the  server  computation  is 
delayed  to  double  the  execution  time  of  a  filter  on  the  server,  etc.  (subsampling  factor  is  8) 


for  the  fastest  external  sort  is  NowSort  [5],  and  we  use  the 
pipelined  version  of  their  two-pass  parallel  sort  for  our  basic 
algorithm. 

The  algorithm  proceeds  in  two  phases.  The  first  phase 
generates  temporary  sorted  runs  on  each  node,  and  the  sec¬ 
ond  phase  produces  the  output  sorted  partition  on  each  node. 
During  the  first  phase,  a  reader  reads  chunks  of  tuples  from 
the  unsorted  input  file  on  disk,  and  partitions  the  records  ac¬ 
cording  to  which  node  it  will  reside  on  when  sorted,  puts 
them  into  in-memory  buffers,  and  when  a  buffer  is  full, 
sends  it  to  the  correct  node.  A  writer  collects  tuples  from 
all  nodes,  and  when  the  in-memory  buffer  is  full,  sorts  it  us¬ 
ing  partial-radix  sort2,  and  writes  the  sorted  run  to  disk.  This 
first  phase  is  over  when  all  the  unsorted  input  files  have  been 
processed,  and  written  to  disk  as  temporary  sorted  runs.  For 
the  second  phase  a  merge-reader  reads  tuples  from  each  lo¬ 
cal  sorted  run  into  merge  input  buffers.  A  merger  selects  the 
lowest-value  key  among  all  merge  input  buffers  and  copies 
it  to  an  output  buffer,  from  which  the  merge-writer  appends 
buffers  to  the  sorted  output  file  on  disk.  This  phase  com¬ 
pletes  when  tuples  from  all  local  runs  have  been  merged. 

6.1.  Filter  Implementation 

The  implementation  of  external  sort  using  filters  follows 
the  above  description.  The  location  of  the  unsorted  dataset 
dictates  the  number  of  nodes  to  be  used  for  execution.  There 
are  two  filters  on  each  node,  Partitioner  and  Sorter.  The 
Partitioner  filter  reads  buffers  from  the  unsorted  input  file, 
and  distributes  the  tuples  into  buckets  based  on  the  key 
value.  When  a  bucket  is  full,  it  is  sent  over  the  stream  that 
connects  to  the  Sorter  filter  on  the  corresponding  node.  The 

2Makingtwo  passes  over  the  keys  with  a  radix  size  of  1 1-bits  [3]  plus  a 
cleanup. 


Sorter  continually  reads  buffers  from  the  input  streams,  and 
extracts  a  portion  of  the  key  and  appends  it  to  a  sort  buffer. 
When  the  sort  buffer  becomes  full,  it  is  sorted  and  written 
to  scratch  space  as  a  single  temporary  run.  When  all  buffers 
have  been  read  from  the  input  streams,  the  merge  phase  be¬ 
gins  with  only  the  Sorter  filters  still  executing.  The  Sorter 
filter  then  reads  sorted  tuples  from  each  of  the  temporary  run 
files  and  merges  them  into  a  single  output  buffer,  and  writes 
this  buffer  to  the  sorted  output  file  on  disk. 

This  application  is  essentially  a  parallel  SPMD  program, 
with  an  all-to-all  communication  pattern.  This  organization 
is  in  contrast  to  the  Virtual  Microscope  application  that  was 
structured  as  a  processing  chain  pipeline. 

6.2.  Experimental  Results 

The  experimental  setup  is  a  16  node  cluster  of  dual 
400MHz  Pentium  IIs  with  256MB  memory  per  node,  run¬ 
ning  Linux  kernel  2.2.12.  There  are  two  interconnects,  a 
shared  Ethernet  segment,  and  a  switched  gigabit  Ethernet 
channel.  We  use  the  faster  switched  interconnect  for  all  ex¬ 
periments,  and  because  of  a  problem  with  the  network  inter¬ 
face  cards  on  some  of  the  nodes,  only  use  a  maximum  of  8 
nodes  in  all  experiments.  The  nodes  are  isolated  from  the 
rest  of  the  network,  and  the  cluster  was  not  running  other 
jobs  during  the  experiments.  Each  node  has  a  single  Ultra2 
SCSI  disk.  All  data  for  a  particular  node,  including  tempo¬ 
rary  data,  is  stored  on  the  single  local  disk.  The  dataset  con¬ 
sists  of  a  single  128MB  unsorted  file  per  node.  The  unsorted 
dataset  was  generated  randomly  with  a  uniform  key  distribu¬ 
tion.  The  execution  time  for  an  experiment  is  the  maximum 
time  across  all  nodes  used  for  the  experiment.  Each  exper¬ 
iment  is  repeated  for  5  trials,  and  the  execution  time  shown 
represents  the  average  of  the  trials.  Both  a  Partitioner  and 


126 


o 

CD 

<r^ 

o 

£ 


tr 

o 

V) 


Filter _ Parameter _ Full  Memory 

Partitioner  read_size  256  KB 

single  disk  buffer  for  reading  tuples  from  the  unsorted  input  file 

Partitioner  bucket.size  1  MB 

shared  space  for  all  outgoing  tuple  buckets,  before  sending  to  Sorter  filters 

Sorter  (phase  1)  keybuLsize  1  MB 

single  buffer  for  storing  extracted  key  and  tuple  pointer,  before  sorting  and  writ- 
ing  the  temporary  run  _ 

Sorter  (phase  2)  sharedbuf  768  KB 

shared  disk  buffer  for  reading  from  all  temporary  runs  during  merge 

Sorter  (phase  2)  Olltputbuf  512  KB 

single  disk  buffer  for  writing  sorted  tuples  to  output  file 

Figure  12.  Memory  parameters  used  by  the 
sort  filters.  The  Full  Memory  column  contains 
the  initial  value  for  each  parameter. 


Figure  11.  Sort  execution  time  as  number  of 
nodes  is  increased.  The  dataset  size  is  scaled 
with  the  number  of  nodes  (128MB/node). 


a  Sorter  filter  are  executed  on  each  node  used  in  the  exper¬ 
iments  for  Figures  11  and  Figures  13(a)-(d),  and  two  Parti¬ 
tioner  and  Sorter  filters  are  executed  on  some  of  the  nodes 
in  the  experiments  for  Figures  13(e)  and  (f).  The  disk  cache 
was  cleared  between  executions  to  insure  a  cold  disk  cache 
for  each  run.  Note  that  we  are  using  a  tightly  coupled  cluster 
for  experimental  repeatability,  and  will  be  switching  to  hosts 
on  a  wide-area  Grid  environment  when  application  behavior 
is  better  understood. 

Scaling.  The  first  experiment  examines  the  scalability  of 
the  sort  application  as  we  increase  the  number  of  nodes  and 
total  dataset  size.  As  seen  in  Figure  11,  the  application  is 
well-behaved.  There  is  good  scaling  due  to  the  fast  inter¬ 
connect  not  becoming  saturated  by  the  traffic  generated  by 
sort.  This  experiment  demonstrates  there  is  nothing  inherent 
in  the  filter-stream  based  implementation  that  would  other¬ 
wise  limit  its  scalability. 

Varying  Memory  Size.  In  this  set  of  experiments  we  vary 
the  amount  of  memory  available  for  filters  on  some  of  the 
nodes  while  keeping  it  constant  for  filters  on  the  remaining 
nodes.  Our  goal  is  to  create  a  heterogeneous  configuration 
in  a  controlled  way,  and  observe  the  effects  of  heterogeneity 
on  the  application  performance. 

Figure  13  shows  the  execution  times  under  varying  mem¬ 
ory  constraints.  The  solid  line  in  all  of  the  graphs  denotes 
the  base  case ,  in  which  the  size  of  the  memory  is  reduced 
equally  across  all  nodes,  and  shows  the  change  in  the  ex¬ 
ecution  time.  The  amount  of  the  Full  memory  case  is  de¬ 
termined  empirically  to  minimize  execution  time  while  con¬ 
suming  the  least  memory  (see  Figure  12).  Memory  param¬ 
eters  are  varied  by  halving  the  full  memory  amount  for  the 
1/2  case,  and  halving  again  for  the  1  /4  case,  etc.  Constrain¬ 


ing  memory  causes  the  filters  to  read/process/write  data  in 
smaller  pieces,  thus  performance  should  suffer.  As  is  shown 
by  the  solid  line  in  the  figure,  the  execution  time  increases  as 
the  size  of  the  memory  is  decreased.  In  the  experiments  with 
heterogeneous  memory  configuration,  we  divide  the  eight 
nodes  into  two  sets  of  four  nodes.  The  first  set  of  nodes  re¬ 
tains  the  initial  amount  of  memory  (i.e.,  Full  memory)  for 
all  runs,  while  the  second  set  has  their  memory  reduced  for 
each  case.  The  left  bars  for  each  case  in  each  graph  shows 
the  maximum  of  the  execution  times  on  the  nodes  with  full 
memory.  Similarly,  the  right  bar  for  each  case  in  each  graph 
shows  the  maximum  of  the  execution  times  on  the  nodes 
with  reduced  memory.  As  is  shown  in  Figure  13(a),  we  ob¬ 
serve  performance  degradation  similar  to  the  base  case.  The 
nodes  that  use  a  constant  amount  of  memory  finish  sooner, 
but  the  entire  job  runs  no  faster.  In  this  experiment,  both  the 
input  data  to  the  Partitioner  filter  and  the  output  of  the  Par¬ 
titioner  (i.e.  the  input  data  to  the  Sorter  filter)  on  each  node 
are  regularly  partitioned  across  all  the  nodes. 

Notice  that  the  total  amount  of  memory  across  all  nodes 
for  this  experiment  is  larger  than  that  for  the  base  case  be¬ 
cause  half  the  nodes  keep  full  memory.  For  example,  for  the 
1/8  memory  case,  350%  more  memory  was  being  used  in 
aggregate  than  for  the  1/8  base  case.  Instead  of  a  reduction 
in  sort  time,  the  extra  memory  results  in  a  load  imbalance 
between  the  two  sets  of  four  nodes.  Hence,  in  the  next  ex¬ 
periment  we  partitioned  the  amount  of  input  data  for  each 
node  irregularly,  to  attempt  to  reduce  overall  execution  time. 
Figure  13(b)  shows  that  the  execution  time  increases  when 
we  partition  the  input  data  based  on  available  node  mem¬ 
ory,  i.e.,  full  nodes  have  more  input  data  than  nodes  with  re¬ 
duced  memory.  This  results  from  an  increase  in  the  time  for 
the  partitioning  phase,  because  the  Partitioner  filters  on  the 
set  of  nodes  with  full  memory  have  more  input  records  to 
process.  The  execution  time  for  the  merge  phase  is  effec- 
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tively  unchanged,  because  the  amount  of  data  sent  to  each 
node  is  unchanged.  Figure  13(c)  shows  the  result  of  parti¬ 
tioning  the  output  of  the  Partitioner  filter  (and  thus  the  merge 
phase  work)  according  to  the  memory  usage  of  the  receiv¬ 
ing  node.  This  experiment,  however,  moves  too  much  work 
to  the  nodes  with  full  memory,  so  that  those  nodes  become 
the  longest  running  node  set.  To  improve  performance  fur¬ 
ther,  we  followed  two  different  approaches.  In  Figure  13(d), 
the  Partitioner  filter  output  is  adjusted  to  balance  the  perfor¬ 
mance  of  both  sets  of  nodes  (approximately  a  10%  reduc¬ 
tion  in  the  number  of  tuples  assigned  to  a  node  for  each  1/2 
reduction  in  memory  usage).  In  this  case,  we  observe  bet¬ 
ter  performance  than  the  previous  cases.  In  the  second  ap¬ 
proach,  we  partitioned  both  the  input  data  and  the  output  of 
the  Partitioner  filter  as  was  done  in  the  experiment  for  Fig¬ 
ure  13(c),  but  executed  two  Sorter  and  two  Partitioner  fil¬ 
ters  on  the  nodes  with  Full  memory  to  take  advantage  of  the 
dual  processors  available  in  each  node.  As  is  seen  in  Fig¬ 
ure  13(e),  the  performance  is  better  than  for  the  previous 
cases.  Finally,  Figure  13(f)  shows  the  combined  effect  of 
running  two  sets  of  filters  on  the  nodes  with  full  memory, 
and  adjusting  the  Partitioner  output  to  balance  the  workload 
across  both  set  of  nodes.  As  expected,  this  configuration 
performs  better  than  all  other  cases.  These  experimental  re¬ 
sults  clearly  show  that  application-level  workload  handling 
and  careful  placement  of  filters  can  deal  with  heterogeneity, 
which  can  have  a  significant  impact  on  performance.  Ques¬ 
tions  that  require  further  investigation  include  (1)  “can  we 
develop  cost  models  for  filters  and  for  the  application  per¬ 
formance  so  that  the  placement  of  filters  and  workload  han¬ 
dling  can  be  done  by  the  runtime  system,  with  little  interven¬ 
tion  from  the  user?”  and  (2)  “can  we  make  use  of  expos¬ 
ing  resource  requirements  and  communication  characteris¬ 
tics  to  develop  accurate  and  efficient  cost  models?”.  We  plan 
to  work  on  more  applications  and  different  configurations  to 
seek  answers  to  these  questions  in  future  work. 

7.  Conclusion  and  Future  Work 

We  have  presented  a  framework,  called  filter-stream  pro¬ 
gramming,  for  developing  data-intensive  applications  in  a 
Grid  environment.  This  framework  represents  the  process¬ 
ing  in  an  application  as  a  set  of  processing  components, 
called  filters.  The  goal  is  to  constraint  application  com¬ 
ponents  to  allow  for  location  independence,  and  to  expose 
communication  characteristics  and  resource  requirements, 
thus  enabling  a  runtime  system  to  support  efficient  execu¬ 
tion  of  the  application.  We  have  described  a  prototype  run¬ 
time  infrastructure  to  execute  applications  using  the  filter- 
stream  programming  framework.  We  have  discussed  imple¬ 
mentations  of  two  data-intensive  applications  that  make  use 
of  our  filter-stream  framework,  and  presented  experimental 
results. 


Our  experimental  results  show  that  there  exists  a  delicate 
balance,  and  sometimes  subtle  interactions  with  heteroge¬ 
neous  resources,  that  can  have  a  large  impact  on  application 
performance.  We  plan  to  further  investigate  such  interac¬ 
tions  to  develop  cost  models  that  can  aid  in  decomposition 
of  applications  into  filters  and  placement  of  the  filters.  We 
also  are  in  the  process  of  implementing  other  applications  to 
use  the  filter-stream  programming  framework  from  applica¬ 
tion  areas  such  as  volume  visualization,  database  decision 
support,  and  image  processing. 
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Figure  13.  Execution  time  of  sorting  a  1GB  dataset  using  8  nodes.  Two  groups  of  four  nodes:  (1) 
memory  usage  held  constant  (Full),  (2)  memory  usage  reduced  (1/X). 
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Abstract 

Enforcement  of  a  high-level  statement  of  security  policy 
may  be  difficult  to  discern  when  mapped  through  func¬ 
tional  requirements  to  a  myriad  of  possible  security  ser¬ 
vices  and  mechanisms  in  a  highly  complex ,  networked 
environment  A  method  for  articulating  network  security 
functional  requirements,  and  their  fulfillment,  is  presented . 
Using  this  method,  security  in  a  quality  of  service  frame¬ 
work  is  discussed  in  terms  of  “variant”  security  mecha¬ 
nisms  and  dynamic  security  policies.  For  illustration,  it  is 
shown  how  this  method  can  be  used  to  represent  Quality  of 
Security  Service  (QoSS)  in  a  network  scheduler  benefit 
function K 


1  Introduction 

Several  efforts  are  underway  to  develop  middleware 
systems  that  will  logically  combine  network  resources  to 
construct  a  “virtual”  computational  system  [4]  [7]  [8] 
[15]  .  These  geographically  distributed,  heterogeneous 
resources  are  expected  to  be  used  to  support  a  heteroge¬ 
neous  mix  of  applications.  Collections  of  tasks  with  dispar¬ 
ate  computation  requirements  will  need  to  be  efficiently 
scheduled  for  remote  execution.  Large  parallelized  compu¬ 
tations  found  in  fields  such  as  astrophysics  [14]  and  meteo¬ 
rology  will  require  allocation  of  perhaps  hundreds  of 
individual  processes  to  underlying  systems.  Multimedia 
applications,  such  as  voice  and  video  will  impose  require¬ 
ments  for  low  jitter,  minimal  packet  losses,  and  isochronal 
data  rates.  Adaptive  applications  will  need  information 
about  their  environment  so  they  can  adjust  to  changing 
conditions. 

User  acceptance  of  these  virtual  systems,  for  either 
commercial  or  military  applications,  will  depend,  in  part, 
upon  the  security,  adaptability,  and  user-responsiveness 
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provided.  Several  of  the  projects  engaged  in  building  the 
middleware  to  create  these  networks  are  pursuing  the  inte¬ 
gration  of  security  [6]  [10]  [23]  and  quality  of  service  [1] 
[17]  into  these  systems.  The  need  for  virtual  networked 
systems  to  both  adapt  to  varying  security  conditions,  and 
offer  the  user  a  range  of  security  choices  is  apparent. 

In  the  network  computing  context,  users  or  user  pro¬ 
grams  may  request  the  execution  of  “jobs,”  which  are 
scheduled  by  an  underlying  control  program  to  execute  on 
local  or  remote  computing  resources.  The  execution  of  the 
job  may  access  or  consume  a  variety  of  network  resources, 
such  as:  local  I/O  device  bandwidth,  internetwork  band¬ 
width;  local  and  remote  CPU  time;  local,  intermediate 
(e.g.,  routing  buffers)  and  remote  storage.  The  resource 
usages  may  be  temporary  or  persistent.  As  there  are  multi¬ 
ple  users  accessing  the  same  resources,  there  are  naturally 
various  allotment,  contention,  and  security  issues  regarding 
use  of  those  resources. 

The  body  of  rules  for  resolving  network  security  issues 
is  called  the  network  security  policy,  whereas  the  body  of 
rules  for  resolving  network  contention  and  allotment  com¬ 
prise  a  network  management  policy  (which  is  sometimes 
taken  to  include  the  network  security  policy).  These  poli¬ 
cies  consist  of  broad  policy  jurisdictions,  such  as  schedul¬ 
ing,  routing,  access  control,  auditing,  and  authentication. 
Furthermore,  these  jurisdictions  can  be  decomposed,  typi¬ 
cally,  into  functional  requirements,  such  as,  “users  from 
network  domain  A  must  not  access  site  B ,”  and  “user  C 
must  receive  a  certain  quality  of  service.”  The  network 
management  and  security  policies,  as  mapped  through  the 
functional  requirements,  may  be  manifested  in  mecha¬ 
nisms  throughout  the  network,  including:  host  computer 
operating  systems,  network  managers,  traffic  shapers, 
schedulers,  routers,  switches  and  combinations  thereof.  As 
these  mechanisms  are  distributed  and  are  often  obscurely 
related,  there  has  been  some  interest  in  the  ability  to 
express  and  quantify  the  level  of  support  for  security  policy 
and  Quality  of  Security  Service  (QoSS:  managing  security 
and  security  requests  as  a  responsive  “service”  for  which 
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quantitative  measurement  of  service  “efficiency”  is  possi¬ 
ble)  provided  in  networked  systems. 

The  purpose  of  this  paper  is  to  present  the  system  devel¬ 
oped  for  the  MSHN  resource  management  system  [8]  for 
describing  network  security  policy  functional  require¬ 
ments,  to  show  how  QoSS  parameters  and  mechanisms  can 
be  represented  in  such  a  system,  and  to  provide  an  example 
of  the  use  of  this  system.  The  remainder  of  this  paper  is 
organized  as  follows.  Section  2  discusses  a  “security  vec¬ 
tor”  for  quantifying  functional  support  of  network  security 
policy.  Section  3  describes  how  the  security  vector  can  be 
used  for  expressing  the  effects  of  QoSS  in  a  network¬ 
scheduling  benefit  function;  and  a  conclusion  follows  in 
Section  4. 

2  Network  Security  Vector 

A  network  security  policy  can  be  viewed  as  an  n-dimen- 
sional  space  of  functional  security  requirements.  We  repre¬ 
sent  this  multidimensional  space  with  a  vector  (S)  of 
security  components.  Each  component  (S.c)  specifies  a 
boolean  functional  requirement,  whereby  the  instantiation 
of  a  network  job  either  meets  (possibly  trivially)  or  does 
not  meet  each  of  the  requirements.  By  convention,  a  secu¬ 
rity  vector’s  components  are  ordered,  so  they  can  be  refer¬ 
enced  ordinally  (S.3)  or  symbolically  (S.c).  A  component 
may  indicate  positive  requirements  (e.g.,  communications 
via  node  n  must  use  encryption)  as  well  as  negative  con¬ 
straints  (e.g.,  users  from  subnet  s  may  not  use  node  n). 
Components  can  also  be  hierarchically  grouped.  [22] 
Requirements  for  a  given  security  service  may  be  repre¬ 
sented  by  one  or  more  components  (indicating  a  service 
sub-vector ),  and  a  security  service  may  utilize  functions 
and  requirements  of  other  services  and  their  components. 

Some  jobs  can  produce  output  in  different  formats , 
where  a  given  format  (e.g.,  high  resolution  video)  might  be 
more  resource  consumptive  than  another  format  (e.g.,  low 
resolution  video).  Formats  may  have  differing  security 
requirements,  even  within  the  same  job.  For  example,  a 
video-stream  format  may  require  less  packet  authentication 
[19]  ,  percentage-wise,  than  a  series  of  fixed  images  based 
on  the  same  data.  A  “quality  of  service”  scheduling  mecha¬ 
nism  might  choose  one  format  for  a  job  over  another, 
depending  on  varying  network  conditions  (e.g.,  traffic  con¬ 
gestion).  Further,  adaptive  applications  may  select  formats 
depending  upon  changing  conditions.  For  example,  IPSec, 
security  association  (SA)  processing  using  ISAKMP  under 
IKE  can  permit  complex  security  choices  through  an  SA 
payload;  and  the  payload  recipient  may  be  given  transform 
choices  regarding  both  Authentication  Header  and  Encap¬ 
sulating  Security  Protocol  [13]  . 


2.1  Notation 

The  set  of  all  jobs  is  represented  by  7.  The  set  of  all  for¬ 
mats  is  represented  by  7.  The  notation  Sjj  identifies  a  vector 
containing  the  portions  of  5  that  are  applicable  to  job  j  in 
format  /,  and  S^c  identifies  a  given  component  (c)  of  Sy. 
The  relation  of  5  to  Sjj  is  clarified  further,  below.  The  fol¬ 
lowing  are  some  informal  examples  of  security-vector 
components: 

•  S.  1 :  user  access  to  resource  is  equal  to  read/write;  based 
on  table  t 

•  S.2:  %  of  packets  authenticated  >=  50,  <=  90;  inc  10 

•  S.3:  clearance  (user)  =  secrecy/integrity  (resource) 

•  S.4:  length  of  confidentiality  encryption  key  >=  64,  <= 
256;  inc  64 

•  S.5:  authentication  header  transform  in  {HMAC-MD5, 
HMAC-SHA} 

•  S.6:  packets  from  domain  A  to  domain  B  must  be 
encrypted 

•  S.7:  packets  from  domain  A  cannot  be  sent  through 
domain  C 

Here,  “inc  10”  indicates  that  the  range  from  50  through 
90  is  quantized  into  increments  of  10,  viz:  50,  60,  70,  80, 
90.  Later,  we  will  need  to  indicate  the  number  of  quantized 
steps  in  the  component;  to  do  this,  one  more  notational  ele¬ 
ment  is  introduced,  [S.c].  In  the  above  examples,  [S.l]  =  1, 
and  [5.2]  =  5. 

2.2  Variant  Security  Components 

When  [S.c]  >  1,  the  underlying  control  program  has  a 
range  within  which  it  may  allow  the  job  to  execute  with 
respect  to  the  policy  requirement.  We  refer  to  this  type  of 
policy,  and  component,  as  “variant.”  Security-variant  poli¬ 
cies  may  be  used  within  a  resource  management  context, 
for  example,  to  effect  adaptation  to  varying  network  condi¬ 
tions.  [18]  Also,  if  the  policy  mechanism  is  variant,  the 
control  program  may  offer  QoSS  choices  to  the  users  to 
indicate  their  preferences  with  respect  to  a  given  job  or 
jobs.  Without  variant  mechanisms,  neither  security  adapt¬ 
ability  by  the  underlying  control  program  nor  QoSS  are 
possible,  since  fixed  policy  mechanisms  do  not  allow  for 
changes  to  security  within  a  fixed  job/resource  environ¬ 
ment.  While  the  expression  S.c  may  contain  a  compound 
boolean  statement  (see  Section  2.3  ),  by  convention  it  may 
contain  only  one  variant  clause. 
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2.3  Component  Structure 

For  use  in  the  examples  in  this  discussion,  a  component 
has  the  following  composition  (see  Table  1  for  details): 

•  component  boolean  expression,  variant-range-spec¬ 
ifier  ;  modifying-clause 

•  boolean  .expression  ::=  boolean.statement  [(or  I  and) 
boolean.statement]  * 

•  boolean.statement  ::=  LHS  boolean-operator  RHS 

Note  that  it  is  not  the  focus  here  to  elaborate  on  a  policy 
representation  language.  See  other  efforts  and  works  in 
progress  [2]  [3]  [5]  [16]. 

A  given  policy  component  has  a  value  which  is  a  bool¬ 
ean  expression.  This  component  may  also  have  an  instanti¬ 
ated  value  with  respect  to  a  specific  job  and  format,  which 
is  either  “true”  or  “false  .”  A  component  has  a  left  hand  side 
(LHS),  which  is  the  item  that  is  being  tested;  of  course  the 
LHS  has  a  value  as  well  as  an  instantiated  value.  A  compo¬ 
nent  also  has  a  right  hand  side  (RHS),  which  is  what  the 
LHS  is  tested  against,  as  well  as  zero  or  more  modifying 
clauses.  Similarly  to  the  LHS,  the  RHS  may  have  a  value 
(or  values)  which  is  dependent  on  the  instantiation  of  the 
component. 

2.4  Dynamic  Security  Policies 

With  a  dynamic  security  policy,  the  value  of  a  vector's 
components  may  depend  on  the  network  “mode”  (e.g.,  nor¬ 


mal,  impacted,  emergency,  etc.),  where  M  is  the  set  of  all 
modes.  There  is,  conceptually,  a  separate  vector  for  each 
operational  mode,  represented  as:  Smode.  Access  to  a  pre¬ 
defined  set  of  alternate  security  policies  allows  their  func¬ 
tional  requirements  and  implementation  mechanisms  to  be 
examined  with  respect  to  the  overall  policy  prior  to  being 
fielded,  rather  than  depending  on  ad  hoc  methods,  for 
example,  during  an  emergency. 

Initially,  every  component  of  S  has  the  same  value  in 
each  of  its  modes.  Ultimately,  components  may  be 
assigned  different  values,  depending  on  the  network  mode. 
For  example: 

•  Snormal.a:  %  packets  authenticated  >=  50,  <=  90;  inc  10 

•  Simpacted.a:  %  of  packets  authenticated  >=  20,  <=  50; 
inc  10 

Note  how  [S.a]  changes  from  5  to  4  under  the 
impacted  mode 

•  Snormal.b:  user  access  to  network  node  =  granted;  based 
on  table  t 

•  Simpacted.b:  user  access  to  network  node  =  granted; 
based  on  table  t,  OR  UID  in  set  of  administrators 

•  Semergency.b:  UID  in  set  of  (administrators,  policymak¬ 
ers} 

Or,  for  example,  policy  makers  might  decide  that  the 
policy  should  remain  in  force  regardless  of  network  mode: 

.  formal  c  =  impacted  £  =  emergency  r  c] earance  (user) 

=  classification  (resource) 


Table  1:  Simple  Component  Elements 


Element  Name 

Example  S.l 

Example  S.2 

Value 

user  access  to  resource  r  =  RW,  based 
on  table  t 

%  of  packets  authenticated 
>=  50,  <=  90;  inc  10 

Instantiated  value 

false 

true 

Value  of  LHS 

user  access  to  resource  r 

%  of  packets  authenticated 

Instantiated  value  of  LHS 

w 

70 

Boolean  operator 

= 

>- 

Value  of  RHS 

RW 

50 

variant  range  specifier 

none  applicable 

<=  90 

Modifying  clause 

based  on  table  t 

inc  10 
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If  a  mode  is  not  specified  for  a  component  (e.g.,  “S.a”), 
normal  mode  is  assumed.  This  will  be  the  case  (i.e.,  the 
mode  is  unspecified)  for  the  remainder  of  this  discussion. 

2.5  Refinements  to  Security  Vector 

R  is  the  set  of  resources  [r}„  rn}.  Ry  is  the  subset  of  R 
utilized  in  executing  job  j  in  format  i. 

Tj  is  the  requested  completion  time  of  job  j. 

Security  policies  may  be  expressed  with  respect  to  prin¬ 
cipals  (user,  group  or  role,  etc.,),  applications,  data  sets 
(both  destination  and  source),  formats,  etc.,  as  well  as 
resources  in  Ry. 

The  definition  of  Sy  is  finally  refined  as  follows:  Sy  is  a 
vector  that  is  an  order-preserving  projection  of  5,  such  that 
a  component  c  from  S  is  in  Sy  if  and  only  if  the  value  of  c 
depends  on  format  i,  job  j,  or  any  r  in  Ry.  The  number  of 
components  in  a  security  vector  Sy  is  [S^]. 

2.6  Summary  of  Security  Vector 

S  is  a  general  purpose  notational  system  suitable  for 
expressing  arbitrarily  complex  sets  of  network  security 
mechanisms.  S  can  express  variant  policies,  to  allow  secu¬ 
rity  expressions  of  quality  of  service  requests,  and  can  have 
dynamic  security  elements  to  accommodate  multiple  situa¬ 
tion-based  policies.  In  particular,  S  can  represent  both  (1) 
static  security  requirements  that  may  be  implemented  in  a 
system,  as  well  as  (2)  the  results  of  running  a  particular  job 
or  set  of  jobs  against  such  static  requirements.  The  latter 
usage  will  be  examined  in  the  next  section,  to  express 
QoSS  in  a  resource  management  system  benefit  function. 

3  Network-Scheduler  Benefit  Function 

As  discussed  above,  various  mechanisms  exist  for  man¬ 
aging  contention  for,  and  allotment  of  distributed  network 
resources.  One  class  of  these  mechanisms  attempts  to  effi¬ 
ciently  schedule  the  execution  of  multiple  (possibly  simul¬ 
taneous)  jobs  on  multiple  distributed  computers  (e.g.,  the 
MSHN  project  [8]  [23]  [24]  [11]  [17] ),  where  each  job 
requires  a  determinable  subset  of  the  resources.  Of  interest 
is  a  benefit  function  for  comparing  the  effectiveness  of 
such  job  scheduling  mechanisms  when  they  are  presented 
with  real  or  hypothetical  “data  sets”  of  jobs. 

Jobs  are  assigned  priorities  for  use  in  resolving  resource 
contention  and  allocation  issues.  In  some  systems,  a  job’s 
priority  may  depend  upon  the  particular  operating  mode  of 
the  network.  [8]  Also,  the  different  data  formats  of  a  multi¬ 
ple-format  job  may  have  different  preferences  (e.g., 
assigned  by  a  user  or  “hard  wired”  as  part  of  the  applica¬ 
tion  or  job-scheduler  database),  and  different  levels  of 


resource  usage.  [10]  [12]  A  network  job  scheduler  should 
receive  more  credit  in  the  benefit  function  for  scheduling 
high  priority  and  high  preference  jobs,  as  opposed  to  low 
priority  or  low  preference  jobs.  That  is  to  say,  a  scheduler 
is  intuitively  doing  a  better  job  if  important  jobs,  as  judged 
by  priority  and  preference,  receive  more  attention  than 
unimportant  jobs.  How  much  weight  the  priorities  and 
preferences  are  given  is  a  matter  of  network  scheduling 
policy. 

For  illustration,  we  introduce  a  simple  benefit  function, 
B ,  to  measure  how  well  a  scheduler  meets  the  goals  of  user 
preference  and  system  priorities  (see  [4]  ,  [12]  and  [21] 
for  other  approaches).  This  function  averages  preference 
( p )  and  priority  (P)  (use  of  a  priority  and  preference  in 
measuring  network  effectiveness  have  been  introduced  for 
the  MSHN  project  [10]  ). 


n  ntj 

I  £  WP.y) 


B=  L=JL=-L 


2  n 


Where  the  characteristic  function  X  is  defined  for  i,  j  as: 
Xij  =  1  if  format  i  was  successfully  delivered  to  job  j 
within  time  Tj,  else  0 

and  at  most  one  format  is  completed  per  job: 


V/e  J 


V/=  1 


\ 

J 


Jobs  and  formats  are  defined  as  above. 

Pj  is  the  priority  of  job  j 

0<P,<1 

The  formats  for  a  job  are  assigned  preferences  (p )  by 
the  user  such  that: 


0  <=  p  <—  1 

nij  is  the  number  of  (format,  preference}  pairs 
assigned  for  job  j 

py  is  the  preference  the  user  has  assigned  to  format  z, 
j°by 

the  preferences  for  a  job  add  up  to  1 : 


mj 

l 

i  =  1 

This  approach  assumes  that  users  will  assign  preference 
values  that  correspond  to  resource  usage,  since  we  want  the 
benefit  function  to  indicate  a  higher  value  when  the  sched¬ 
uler  succeeds  in  scheduling  “harder”  jobs  [12]  . 
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3.1  Adding  Security  to  the  Benefit  Function 

We  wish  the  benefit  function  to  reflect  the  effectiveness 
and  restrictions  of  the  security  policy.  First,  we  define  the 
characteristic  security  function  Z,  for  i  and  j: 

Zjj  =  1  if  the  instantiated  value  of  all  components  in  5^* 
are  true,  else  0 

The  numerator  of  the  benefit  function  is  multiplied  by  Z, 
so  that  no  credit  is  given  for  jobs  that  fail  to  meet  the  secu¬ 
rity  requirements: 


again  (0<  Ai7<  1) 

In  the  final  expression  of  the  network  benefit  function,  A 
is  added  to  the  numerator,  providing  an  average  of  security, 
priority  and  preference. 


5= 


n  mj 

2 


/=!'=  1 

3  n 


5= 


n  mj 

2  2  Wj+pu) 


i^-U  = 1 

2n 


Now,  for  variant  components,  we  wish  to  be  able  to  give 
less  credit  to  the  scheduler  for  fulfilling  less  “difficult” 
security  requirements.  One  algorithm  for  expressing  this  is 
for  each  instantiated  component  (c)  in  to  be  assigned  a 
security  completion  token  (g)  where  0  <  g  <  1 .  gc  will 
indicate  the  completion  token  corresponding  to  component 
S.c.  gc  is  defined  to  be  the  “percentage”  of  [ S.c ]  met  or 
exceeded  by  the  instantiated  value  of  the  component’s 
LHS  (notated  as  S.c”): 
gc  ~  S.c” /[S.c] 

To  illustrate  the  calculation  of  gj ,  for  component  S.l: 

S.7:  %  of  packets  authenticated  >=  50,  <=  90;  inc  10 
[S.  7]  =  5  (the  number  of  quanta  in  S.  1),  S.7”  =  3  (the 
job  achieves  the  3rd  quantum  (70)) 
g}  =  3/5  =  0.6 

For  invariant  components,  g  =  1  or  g  =  0.  A  token  (g) 
whose  value  is  0  represents  a  job  “failing”  the  component’s 
security  policy.  Recall  that  Z  will  be  0  when  the  job/format 
fails  to  meet  the  requirement  of  any  security  component, 
meaning  that  the  function  returns  no  benefit  value  for  that 
job/format.  We  introduce  a  function  (A)  which  averages  the 
tokens  of  a  job: 


Aij  -  (gj  +  g2  +  ••  +  gn)tn 

where  n  =  [S^]  ~  the  number  of  components  in  S ^ 
and  (0<A(7<1) 

Averages,  such  as  A,  over  many  different  elements  can 
tend  to  minimize  the  difference  that  is  seen  between  differ¬ 
ent  data  sets.  Therefore,  we  weight  the  tokens  (g)  assigned 
to  individual  security  components  to  give  more  credence  to 
components  that  are  “more  important”  than  others,  e.g., 
reflecting  network  management  policies.  Each  gn  has  a  cor¬ 
responding  integer  weight  (w„),  >  0 .  So  Aij  becomes: 

Aij  =  (gi™i  +  g2w2  +  -  +  gnwny(wl  +  ™2  +  ••  +  wj 


0  <  B  <  1  ,  where  1  indicates  the  maximum 
scheduling  effectiveness. 

3.2  Applicability 

This  technique  for  quantifying  the  variant  security 
instantiated  by  a  resource  management  system  is  being 
used  in  the  MSHN  project  as  a  factor  in  representing  the 
effectiveness  of  its  resource  assignments  [10]  .  In  the 
MSHN  design,  the  security  requirements  of  network 
resources  (represented  by  S)  are  stored  in  a  Resource 
Requirements  Database.  This  database  is  consulted  during 
the  resource  scheduling  phase  to  effectively  match  jobs  to 
resources.  We  expect  that  this  measurement  technique 
could  also  be  applied  to  other  resource  management  sys¬ 
tems,  such  as  Condor  [15]  and  Globus  [7] . 

While  different  schedulers  could  be  compared  with 
respect  to  the  individual  components  of  B ,  a  summary 
function  such  as  B  would  be  useful  to  automate  and  nor¬ 
malize  the  comparison  process.  Additionally,  we  expect 
that  the  security  component  (viz,  A)  in  an  operational  sys¬ 
tem  would  be  complex  enough  to  evade  effective  manual 
analysis. 

4  Discussion  and  Conclusion 

A  security  vector  has  been  presented  for  describing 
functional  requirements  of  network  security  policies.  It  has 
been  shown  that  this  vector  can  be  used  for  representing 
security  with  respect  to  both  quality  of  service  and  a  net¬ 
work  scheduler  benefit  function. 

We  are  involved  in  ongoing  work  to  organize  the  secu¬ 
rity  vector  into  a  “normal  form”  with  sub- vectors  or  hierar¬ 
chies  corresponding  to  security  policy  jurisdictions  (such 
as:  access  control,  auditing,  and  authentication)  and  to 
incorporate  a  costing  methodology  for  security  compo¬ 
nents,  such  as  can  be  provided  to  a  resource  management 
system  [9]  .  We  are  working  to  develop  a  means  of  adjust¬ 
ing  the  preference  expression  with  a  notion  of  the  corre¬ 
sponding  resource  usage  [12]  .  We  are  considering  how  to 
expand  the  security  benefit  function  (A)  to  reflect  user  qual- 
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ity  of  security  service  choices  within  the  range  allowed  by 
variant  security  components,  and  to  reflect  performance 
implications  of  redundant  security  mechanisms. 

The  organizational  security  policy  [20]  governing  the 
network  may  allow  individuals  or  principals  representing 
them  to  override  rules  represented  by  invariant  security 
vector  components.  For  example,  a  military  commander 
might  decide  to  forgo  cryptographic  secrecy  mechanisms 
for  a  job  in  an  emergency  (e.g.,  to  improve  network  perfor¬ 
mance),  even  though  the  system  has  not  been  configured 
with  “dynamic’’  or  “variant”  security  mechanisms,  as 
defined  herein.  From  the  perspective  of  the  security  vector 
S  and  the  benefit  function,  this  is  a  change  to  or  violation 
of  the  computer  security  policy.  It  is  recommended  that  this 
type  of  policy  change  be  audited. 
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Abstract 

Gardens  is  an  integrated  programming  language  and 
system  designed  to  support  parallel  computing  across  non- 
dedicated  cluster  computers,  in  particular  networks  of  PCs. 
To  utilise  non-dedicated  machines  a  program  must  adapt 
to  those  currently  available.  In  Gardens  this  is  realised 
by  over  decomposing  a  program  into  more  tasks  than  pro¬ 
cessors,  and  migrating  tasks  to  implement  adaptation.  To 
be  effective  this  requires  efficient  task  migration .  Fur¬ 
thermore,  typically  non-dedicated  clusters  contain  different 
machines  hence  heterogeneous  task  migration  is  required. 
Gardens  supports  efficient  task  migration  between  hetero¬ 
geneous  machines  via  meta-information  which  completely 
describes  a  task's  state.  By  identifying  different  degrees  of 
heterogeneity  and  different  kinds  of  tasks,  we  are  able  to 
optimise  task  migration.  The  main  contribution  of  this  pa¬ 
per  is  to  show  how  heterogeneous  task  migration  may  be 
optimised. 


1.  Introduction 

In  the  aggregate,  networks  of  workstations  represent  a 
huge  and  cheap  unused  computing  resource.  By  their  very 
nature  such  non-dedicated  cluster  computers  are  dynamic. 
The  workstations  available  to  a  computation  will  typically 
change  during  the  execution  of  a  program  as  workstation 
users  come  and  go.  Thus  programs  must  adapt  to  the  chang¬ 
ing  availability  of  workstations. 

The  Gardens  system  [28]  is  an  integrated  programming 
language  and  system  targeted  at  non-dedicated  cluster  com¬ 
puters.  The  goals  of  Gardens  are:  adaptation,  safety,  ab¬ 
straction  and  performance  (ASAP!).  These  are  realised  in 
part  by  a  modern  object  oriented  programming  language, 
Mianjin  [27],  a  derivative  of  Pascal.  The  Gardens  system 
and  Mianjin  programming  language  are  custom  designed 


and  built;  thus  we  have  complete  control  over  both  of  these. 

Gardens  utilises  task  migration  to  realise  adaptation.  A 
program  is  over  decomposed  into  more  tasks  than  proces¬ 
sors  and  tasks  are  migrated  in  response  to  changing  work¬ 
station  loads.  This  adaptation  is  transparent  to  the  program¬ 
mer.  Typically,  workstation  networks  comprise  a  collec¬ 
tion  of  different  machines.  Thus  efficient  use  of  such  ma¬ 
chines  entails  heterogeneous  computing  and  heterogeneous 
task  migration,  the  subject  of  this  paper.  Task  migration  is 
integrated  into  both  Gardens  and  the  Mianjin  compiler. 

Tasks  communicate  via  a  virtual  shared  object  space. 
Tasks  may  reference  objects  belonging  to  other  tasks,  to 
communicate  they  invoke  methods  on  such  remote  objects. 
This  is  the  only  way  tasks  may  communicate,  tasks  cannot 
otherwise  share  data. 

The  main  contribution  of  this  paper  is  to  show  how  effi¬ 
cient  heterogeneous  task  migration  may  be  achieved,  to  re¬ 
alise  adaptive  utilisation  of  workstation  clusters;  however,  it 
should  be  noted  that  there  are  other  uses  for  task  migration 
e.g.  to  implement  automatic  fault  tolerance  through  migrat¬ 
ing  tasks  to  disk. 

The  next  section  summarises  our  techniques  for  achiev¬ 
ing  heterogeneous  task  migration.  Section  3  presents  a  more 
detailed  look  at  the  implementation.  Some  performance  fig¬ 
ures  are  reported  in  Section  4.  Section  5  presents  related 
work,  and  the  final  section  discusses  the  work  and  future 
directions. 

2.  Task  migration 

To  implement  task  migration  Gardens  uses  meta¬ 
information  which  fully  describes  a  task’s  state.  This  meta¬ 
information  is  generated  by  the  Gardens  Mianjin  compiler. 
The  meta-information  is  similar  to  that  available  in  Java, 
although  in  addition  to  heap  objects  our  meta-information 
also  describes  stack  frames.  A  prerequisite  for  task  migra¬ 
tion  is  a  safe  language.  For  example  we  cannot  allow  a 


0-7695-0556-2/00  $10.00  ©  2000  IEEE 


140 


pointer  to  masquerade  as  an  integer  or  vice  versa  since  in¬ 
tegers  and  pointers  may  require  different  translation  under 
task  migration  (see  Section  3). 

Task  migration  is  only  supported  at  predetermined  call 
points  in  the  program.  These  migration  points  may  be  man¬ 
ually  inserted  by  the  programmer  or  automatically  by  the 
compiler.  At  these  points  the  compiler  generates  the  addi¬ 
tional  code  and  meta-information  to  support  task  migration. 
This  can  support  both  preemptive  and  non-preemptive  task 
migration.  Our  current  compiler  does  not  support  optimised 
code,  although  this  is  currently  under  investigation.  To  sim¬ 
plify  migration  we  arrange  for  data  structures  to  have  com¬ 
mon  alignments  and  sizes  across  all  platforms.  We  can  do 
this  since  we  have  a  custom  system,  language  and  compiler. 
A  foreign  language  interface  mechanism  supports  interop¬ 
erability,  but  we  do  not  support  task  migration  within  such 
code. 

Meta-information  is  used  to  recover  a  task’s  state.  A 
task’s  state  comprises  its  stack,  heap  and  global  variables; 
registers  and  the  PC  are  flushed  to  the  stack.  We  do  not 
migrate  OS  process  state,  our  goal  is  to  handle  such  state 
within  wrapper  libraries  that  can  be  migrated.  The  stack 
and  heap  are  similar  in  that  both  can  be  viewed  as  collec¬ 
tions  of  tagged  records.  The  heap  contains  tagged  objects, 
the  stack  contains  tagged  activation  records,  all  objects  in 
stack  frames  are  statically  known  at  compile  time.  When  re¬ 
quired,  a  task’s  state  is  transformed  into  a  state  suitable  for 
the  target  machine.  In  general  this  is  done  lazily  since  the 
task  may  initially  be  saved  to  stable  storage  hence  its  des¬ 
tination  may  be  unknown  hence  the  information  for  trans¬ 
formation  will  only  be  available  at  task  load  time  e.g.  stack 
and  heap  base  addresses. 

To  make  task  migration  efficient  we  use  optimisations 
based  on  different  kinds  of  tasks  and  different  degrees  of 
platform  heterogeneity.  This  is  described  in  the  following 
sections. 

2.1.  Different  kinds  of  tasks 

There  are  three  kinds  of  tasks  in  Gardens  which  have 
different  migration  requirements.  These  different  kinds  of 
tasks  can  be  distinguished  by  the  runtime  system. 

Seed  tasks  are  newly  created  tasks  which  have  never  been 
run.  They  comprise  just  the  initial  data  passed  in  a 
create  task  operation,  they  have  no  stack  nor  heap  and 
hence  are  trivial  to  migrate.  Seed  tasks  are  stored  in  a 
separate  structure  from  other  tasks  until  they  are  run, 
and  hence  are  easily  distinguished  from  other  tasks. 

Stackless  tasks  have  no  stack.  They  correspond  to  an  in¬ 
verted  programming  style  as  often  used  in  event  driven 
programming  where  control  must  periodically  return 


to  an  event  loop.  Such  tasks  require  only  heap  migra¬ 
tion,  since  their  stacks  are  empty,  which  can  be  con¬ 
siderably  simpler  than  full  task  migration;  some  early 
work  on  this  was  reported  in  [29].  A  stackless  task 
may  also  be  a  task  that  has  completed  its  main  thread 
of  execution  but  has  “actions”  to  perform  or  objects  in 
its  heap  which  are  referenced  by  other  tasks. 

Full  tasks  have  stacks  and  heaps  both  of  which  must  be  mi¬ 
grated.  These  are  the  most  expensive  tasks  to  migrate. 

2.2.  Degrees  of  heterogeneity 

There  is  a  spectrum  of  degrees  of  system  heterogeneity: 

0.  Same  architecture,  statically  linked  code,  same  heap 
and  stack  base  addresses:  a  completely  homogeneous 
platform.  For  such  platforms  no  state  transformation  is 
required  and  migration  corresponds  to  a  straight  mem¬ 
ory  copy  from  one  machine  to  another.  However  in 
practice  few  modern  platforms  are  this  simple. 

1 .  Same  platform,  but  different  base  addresses  (e.g.  due 
to  dynamic  linking  and  loading):  since  all  structures 
have  common  sizes  and  alignments,  and  all  stack 
frames  have  the  same  representation,  task  migration 
only  requires  stack  and  heap  pointers  to  be  adjusted 
to  deal  with  new  stack  and  heap  base  offsets.  This  re¬ 
quires  meta-information  to  locate  all  pointers  in  stack 
frames  and  heap  objects. 

2.  Different  architecture,  but  same  word  size:  for  heaps 
this  requires  stack  and  heap  pointer  adjustments  as 
described  above,  endian  adjustments,  code  and  data 
pointer  translations  and  type  representation  adjust¬ 
ments  (e.g.  for  floating  point  types).  In  the  case  of 
migrating  a  stack,  the  stack  must  be  rebuilt  with  dif¬ 
ferent  activation  record  conventions,  e.g.  stack  mark 
information,  register  window  flushing  for  SPARC  etc. 
This  can  be  very  expensive  to  perform. 

3.  Different  architecture  and  different  word  size:  at 
present  we  do  not  address  this  level  of  heterogeneity. 
Note,  some  64  bit  processors  are  capable  of  running  32 
bit  code  which  may  prove  useful. 

3.  Implementation 
3.1.  Meta-information 

All  objects  (records  and  arrays)  in  Gardens  have  an  iden¬ 
tifying  “tag”  located  in  the  two  words  before  the  logical  start 
of  the  object  in  memory.  One  word  is  used  for  garbage  col¬ 
lection  purposes  while  the  second  points  to  the  object’s  type 
descriptor;  these  type  descriptors  serve  two  purposes. 
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Firstly,  a  type  descriptor  holds  typical  runtime  informa¬ 
tion  such  as  dimension  and  element  size  for  arrays  and 
method,  ancestor  and  pointer  tables  for  records  to  enable 
construction  and  polymorphism.  Secondly,  the  type  de¬ 
scriptors  contain  links  to  complete  meta-information  gen¬ 
erated  by  the  Gardens  compiler  to  aid  task  migration.  This 
meta-information  maps  out  the  size,  location  and  type  of 
fields  or  elements  of  the  type  in  question.  This  allows  the 
traversal  of  all  objects  at  runtime.  Furthermore,  the  meta¬ 
information  contains  a  link  to  a  descriptor  for  the  type’s 
module  as  well  as  a  per-module  index  that  is  assigned  to  the 
type  at  compile  time.  From  this,  a  unique  module  ID/per- 
module  index  pair  may  be  obtained  to  identify  the  type 
across  heterogeneous  platforms. 

In  addition  to  the  meta-information  for  types,  meta¬ 
information  for  procedures  is  generated  as  well.  Procedural 
meta-information  maps  out  procedure  entry  addresses,  lo¬ 
cal  variable  and  parameter  information  (number,  type  and 
frame  offset),  possible  frame  offset  location  of  saved  dis¬ 
play  values,  a  module  descriptor  link  and  per-module  index 
similar  to  those  described  for  types.  This  information  al¬ 
lows  for  traversal  of  a  stack  frame,  given  the  location  of  a 
frame  via  a  frame  pointer  and  for  unique  identification  for 
procedures,  in  function  pointers  or  as  instances  in  the  stack, 
across  platforms. 

Finally,  the  module  descriptor  contains  compilation  and 
time  stamps  as  well  as  three  tables  mapping  per-module  in¬ 
dexes  to  concrete  addresses.  The  first  two  tables  correspond 
to  the  type  and  procedure  indexes  while  the  third  table  maps 
potential  migration  points. 

3.2.  Heap  migration 

Each  heap  segment  in  Gardens  comprises  of  two  logi¬ 
cal  parts:  the  contiguous  memory  in  which  objects  are  allo¬ 
cated  and  the  runtime  information  for  managing  that  mem¬ 
ory.  Having  designed  and  implemented  the  Gardens  com¬ 
piler  and  runtime  system  has  allowed  us  to  ensure  that: 

1.  Objects  of  identical  type  are  aligned  identically  across 
platforms. 

2.  Heap  segment  structure  is  identical  across  platforms. 

3.  Heap  runtime  information  is  logically  identical  across 
platforms. 

This  makes  heap  migration  relatively  simple;  all  that  is 
required  is  a  few  changes  to  the  heap  segment’s  representa¬ 
tion.  Furthermore,  since  all  hosts  in  Gardens  environment 
have  complete  information  as  to  the  characteristics  (archi¬ 
tecture  and  operating  system)  of  the  other  hosts,  these  rep¬ 
resentation  changes  may  be  made  directly  by  the  source  or 
destination  host;  packing  to  and  unpacking  from  an  inter¬ 
mediate  representation  is  avoided. 


Representation  changes  fall  into  three  categories: 

1.  Pointer  rebasing 

2.  Endian  adjustment 

3.  Code/Data  segment  address  translation 

The  current  platforms  targeted  by  Gardens  have  similar 
representation  for  types  (floating  points,  booleans,  etc.)  so 
type  adjustments  are  currently  not  necessary. 

Pointer  rebasing  is  necessary  when  migrating  between 
hosts  with  a  degree  of  heterogeneity  of  (1)  and  (2)  and  in¬ 
volves  traversal  of  all  pointer  fields  within  objects  in  the 
heap  segment  and  adjusting  any  non-null  pointers  by  an  ap¬ 
propriate  heap  offset.  Pointer  rebasing  is  performed  by  the 
source  host  and,  in  the  case  of  objects  in  the  heap,  requires 
only  the  pointer  table  found  in  the  type  descriptor. 

Endian  adjustment  requires  a  full  traversal  of  all  objects 
in  the  heap  segment  and  performing  byte  swapping  on  fields 
of  necessary  size.  This  is  only  required  in  degree  (2)  cases 
in  which  machines  are  of  different  endian.  Endian  swapping 
is  generally  performed  by  the  source  host. 

Code/data  segment  address  translation  is  also  only  re¬ 
quired  in  degree  (2)  cases.  Since  the  layout  of  code  and 
data  segments  will  not  be  identical  across  heterogeneous 
platforms,  pointers  into  the  code  or  data  segments  cannot 
be  simply  “rebased”.  In  Gardens,  however,  the  only  can¬ 
didates  for  code  and  data  segment  pointers  in  the  heap  are 
procedure  variables  and  type  descriptor  addresses  present  in 
each  object  tag.  These  are  replaced  with  procedure  and  type 
module  ID/per-module  index  identifiers  on  the  source  side 
and  replaced  with  the  host  specific  address  on  the  destina¬ 
tion  side. 

Heap  migration  in  Gardens  thus  breaks  down  into  per¬ 
forming  the  above  transformations  to  the  objects  in  the  heap 
segment  and  to  the  runtime  information  describing  the  heap 
itself. 

The  objects  in  the  heap  segment  may  be  located  by  scan¬ 
ning  a  heap  object  bitmap  that  is  maintained  by  the  run¬ 
time  system  for  memory  management  and  garbage  collec¬ 
tion.  The  type  and  meta-information  of  each  object  is  then 
obtained  by  inspecting  the  object’s  tag  allowing  the  object 
to  be  traversed  and  transformed  as  necessary. 

The  runtime  information  for  the  heap  consists  of  the  por¬ 
tion  of  the  heap  object  bitmap  relevant  to  the  heap  segment, 
a  heap  descriptor  located  at  the  start  of  each  heap  segment 
and  the  list  of  free  blocks  for  the  heap.  The  heap  object 
bitmap  contains  no  pointers  and  only  needs  endian  adjust¬ 
ments;  however,  the  heap  object  bitmap  remains  in  use  on 
the  source  host  after  heap  migration  has  occurred  (to  mark 
objects  and  heap  segments  as  remote).  Therefore,  endian 
adjustments  on  the  heap  object  bitmap  are  performed  by  the 
destination  host  if  necessary.  The  heap  descriptor  and  free 
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block  list  are  indistinguishable  from  other  objects  and  are 
transformed  correspondingly. 

3.3.  Stack  migration 

Stack  segments  in  Gardens  comprise  of  a  runtime  stack 
and  a  task  descriptor.  The  task  descriptor  stores  stack  and 
context  information  along  with  some  programmer  definable 
task  property  objects.  As  with  heap  descriptors,  transforma¬ 
tion  of  the  task  descriptor  is  straightforward.  The  method  of 
runtime  stack  transformation,  however,  depends  upon  the 
degree  of  heterogeneity  between  hosts. 

Degree  (0),  of  course,  requires  no  modifications  to  the 
stack. 

Degree  (1)  requires  only  pointer  rebasing  as  in  degree 
(1)  heap  migration.  Candidates  for  pointer  rebasing  in  stack 
migration  are  no  longer  just  pointer  fields  in  structures  but 
pointers  in  local  variables  and  parameters,  variable  parame¬ 
ters,  display  pointers  saved  in  stack  frames  and  frame  point¬ 
ers  themselves.  The  stack  is  traversed  using  the  instruction 
pointer  (or  return  address)  for  each  stack  frame  to  obtain  the 
corresponding  procedure’s  meta-information. 

For  degree  (2)  all  three  representation  transformations 
described  above  need  be  performed.  In  addition  to  this,  the 
layout  of  each  of  the  stack  frames  needs  to  be  restructured 
to  match  that  of  the  destination  host.  To  achieve  this,  the 
source  host  traverses  the  stack  and  deconstructs  the  stack 
into  an  abstract  stack  while  performing  the  representation 
transformations.  A  list  of  all  variable  parameters  that  point 
into  the  stack  and  the  address  into  the  abstract  stack  at  which 
they  point  is  also  constructed  by  the  host. 

The  abstract  stack  is  similar  to  the  concrete  stack  in  some 
ways.  Each  concrete  frame  has  a  corresponding  abstract 
frame  and  each  abstract  frame  has  a  parameter  section,  stack 
mark,  local  variable  section  and  workspace  (for  value  open 
arrays  and  value  reference  records).  However,  the  abstract 
stack  format  does  differ  from  concrete  stacks  in  the  follow¬ 
ing  manners: 

•  Abstract  frames  are  in  opposite  order  to  those  in  the 
concrete  stack  with  the  abstract  frame  pointers  refer¬ 
ring  to  the  following  frame  rather  than  the  proceeding 
frame. 

•  Along  with  an  abstract  frame  pointer,  each  abstract 
stack  mark  contains  module  ID/per-module  index 
identifier  for  the  return  address  and  a  similar  identifier 
for  the  procedure  relevant  procedure. 

•  Parameters  and  local  variables  are  stored  in  order  from 
the  abstract  stack  mark  as  frame  offsets  of  parameters 
and  local  variables  differ  across  platforms. 

Once  the  destination  host  has  received  the  abstract  stack, 
it  rebuilds  a  stack  specific  to  its  architecture  using  a  novel 


approach.  For  each  stack  frame,  the  parameters  are  first 
loaded  from  the  abstract  stack  (into  the  concrete  stack  or 
the  parameter  registers).  A  context  switch  is  performed  to 
the  new  stack  and  a  dummy  procedure  prologue  is  called  for 
the  appropriate  procedure.  This  allocates  appropriate  stack 
space,  updates  the  display  vector  and  copies  any  value  ar¬ 
rays  into  the  appropriate  position  in  the  workspace.  Context 
is  switched  back  to  the  original  stack  and  the  correct  re¬ 
turn  address  is  inserted  into  the  newly  allocated  stack  frame. 
Local  variables  are  copied  to  their  respective  positions  and 
code/data  segment  translations  are  performed.  Finally,  the 
variable  parameter  list  is  checked  to  see  if  memory  pointed 
to  by  a  variable  parameter  down  the  stack  has  been  copied 
to  the  concrete  stack.  If  so,  the  variable  parameter  value 
is  adjusted  to  reflect  the  change.  The  stack  frame  is  then 
complete. 

Finally,  the  task  descriptor  requires  some  minor  changes 
to  reflect  the  state  of  the  stack  on  the  new  host. 

4.  Performance 

The  measurements  below  were  taken  on:  233  MHz  Pen¬ 
tium  II,  96  Mb  RAM  running  RedHat  Linux  v5.2  and  Sun 
4  Sparc,  32  Mb  RAM  running  Solaris  2.51.  For  degrees  of 
heterogeneity  (0)  and  (1),  figures  are  specified  for  machines 
running  Linux  as  specified  above.  For  degree  (2),  (LS)  in¬ 
dicates  migration  from  Linux  to  Solaris  and  (SL)  indicates 
migration  from  Solaris  to  Linux. 

The  measurements  are  for  a  recursive  sum  program  of 
approximately  40  stack  frames  and  linked  list  program 
with  1000  objects  in  the  heap.  Each  set  of  measurements 
presents  the  time  taken  for  task  transformation  only  (that  is, 
no  communication  times  are  included)  with  the  measure¬ 
ments  split  between  the  time  taken  by  the  source  and  desti¬ 
nation  hosts  to  perform  the  transformations  necessary. 


Seed  Task  Migration 


Degree 

Source 
(time  ps) 

Dest. 
(time  ps) 

6 

6 

4 

l 

8 
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2(LS) 
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8 

2(SL) 
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Stackless  Yask  Migration  (Linked  List) 
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Stack 

j  Source 

Dest. 

Source 

Dest. 
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(time  ms) 

(time  ms) 

0 

0 

0 

0 

0 

1 

0.64 

0 

0.03 

0 

2(LS) 

4.49 

27.65 

0.64 

0.95 

2(SL) 

57.20 

2.46 

7.83 

0.05 
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Full  Task  Migration  (Linked  List) 


Degree 

Heap 

Source  Dest. 

(time  ms)  (time  ms) 

Stack 

Source  Dest. 

(time  ms)  (time  ms) 

0 

0 

0 

0 

0 

1 

0.65 

0 

1.06 

0 

2(LS) 

4.73 

30.08 

1.21 

7.37 

2(SL) 

59.38 

2.47 

12.47 

0.83 

Full  Task  Migration  (Recursive  Sum) 
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Stack 

Source 

Dest. 

Source 

Dest. 

Degree 

(time  ms) 

(time  ms) 

(time  ms) 

(time  ms) 

0 

0 

0 

0 

0 

1 

0.03 

0 

1.14 

0 

2(LS) 

0.03 

0.82 

1.46 

10.70 

2(SL) 

0.09 

0.09 

13.95 

0.98 

The  above  figures  clearly  show  the  advantages  of  iden¬ 
tifying  and  targeting  both  the  different  degrees  of  hetero¬ 
geneity  and  different  classes  of  tasks  to  migrate. 

In  the  context  of  task  creation  and  initial  load  balancing, 
it  is  clear  that  seed  task  migration  holds  the  greatest  advan¬ 
tage  for  all  degrees  of  heterogeneity,  especially  since  seed 
tasks  incur  the  smallest  communication  costs  due  to  their 
size. 

Similarly,  the  speed  up  for  degree  (1)  migration  from  de¬ 
gree  (2)  migration  (twice  as  fast  for  stack  migration,  ten 
times  as  fast  for  heap  migration)  is  considerable.  The  heap 
migration  figures  reflect  the  use  of  pointer  tables  for  pointer 
rebasing  for  degree  (1)  migration.  This  suggests  a  similar 
pointer  map  should  be  implemented  for  stack  frames. 

Of  note  are  the  times  for  the  stack  and  task  descriptor 
transformations  for  stackless  and  full  task  migration.  The 
full  task  recursive  sum  stack  transformation  with  40  stack 
frames  takes  is  only  10%  to  20%  slower  than  the  full  task 
linked  list  stack  transformation.  We  believe  this  is  due 
to  inefficiencies  in  our  current  method  of  loading  meta¬ 
information.  This  is  further  illustrated  by  the  stack  transfor¬ 
mation  (really  task  descriptor  transformation)  figure  for  the 
degree  (2)  stackless  task  migration;  full  meta-information 
information  for  the  programmer  defined  task  properties  ob¬ 
ject  must  be  loaded  whereas  only  pointer  tables  are  required 
for  degree  (1)  migration. 

5.  Related  Work 

There  are  three  main  approaches  to  task  migration  across 
heterogeneous  platforms  [32].  The  first  approach  assumes 
that  all  tasks  will  execute  on  a  virtual  machine  that  is  avail¬ 
able  on  all  hosts  in  the  system,  for  example  the  Java  Virtual 


Machine.  The  second  and  third  approaches  both  assume 
that  tasks  will  execute  on  their  host’s  native  machine.  To  do 
this  they  need  to  generate  meta-information  on  the  execut¬ 
ing  task,  so  that  they  can  translate  the  task’s  execution  state 
from  one  native  machine’s  format  to  another.  The  second 
approach  relies  on  code  that  collects  this  meta-information 
to  be  included  in  the  task’s  source  code.  This  can  either  ei¬ 
ther  be  done  manually  by  the  programmer  or  automatically 
by  a  pre-processor.  The  third  approach  relies  on  the  com¬ 
piler  and  runtime  system  to  generate  the  meta-information. 

The  first  approach  is  much  simpler  than  the  other  two, 
as  the  use  of  a  common  execution  environment  reduces 
the  problem  to  one  that  can  be  solved  via  a  homoge¬ 
neous  migration  solution.  This  approach  was  initially  used 
by  Chameleon  [12].  Today  it  is  widely  used  by  mo¬ 
bile  agent  systems,  such  as  Agent  TCL  [16],  Aglets  [18], 
ARA  [22],  Concordia  [11],  Extended  Facile  [23],  Liquid 
Software  [13],  Mole  [2],  Obliq  [5],  Odyssey  [15],  Omni¬ 
ware  [20],  Sumatra  [1],  TACOMA  [39]  and  Telescript  [40]. 
Despite  this  approach’s  simplicity  it  suffers  performance 
penalties  from  the  use  of  a  virtual  machine.  Some  solu¬ 
tions  [20,  13]  alleviate  this  problem  by  using  “on-the-fly 
compilation”  to  translate  parts  of  the  task’s  code  to  native 
code.  However,  the  native  code  produced  is  still  25  percent 
slower  than  regular  native  code  [1],  as  they  must  include 
safeguards  to  protect  the  execution  environment  from  being 
corrupted  by  the  native  code. 

The  second  and  third  approaches  provide  better  perfor¬ 
mance  results  than  the  first  approach,  as  they  allow  the  tasks 
to  execute  directly  on  the  native  machine.  The  second  ap¬ 
proach  is  more  portable  than  the  third  approach,  as  it  does 
not  require  a  specialised  compiler.  However,  the  third  ap¬ 
proach  delivers  better  runtime  performance  as  it  does  not 
need  to  generate  all  of  its  meta-information  at  runtime. 
In  addition  to  this,  the  third  approach’s  migration  mech¬ 
anism  is  more  transparent  to  the  programmer.  Examples 
of  the  second  approach  include:  HMF  [21],  Process  In¬ 
trospection  [14],  HiCaM  [25],  Ythreads  [30],  Arachne  [8], 
DOME  [31],  PMT  [32]  and  MpPVM  [6].  Examples  of 
the  third  approach  include:  Emerald  [35],  Tui  [34],  Shub, 
Dubach  and  Rutherford’s  work  [9,  10],  Hollander  and  Sil- 
berman’s  work  [17],  Distributed  C  [24]  and  porch  [36].  Our 
work  is  based  on  the  third  approach,  as  it  provides  the  most 
optimal  results. 

With  heterogeneous  task  migration,  most  research  has 
focused  on  how  to  reconstruct  the  task’s  state  [38,  3,  7,  30], 
the  location  of  migration  points  [4,  35]  and  analysing  the 
safety  aspects  of  this  approach  [34,  19].  Very  little  research 
has  been  done  on  how  to  optimise  migration  based  on  dif¬ 
ferent  kinds  of  tasks  and  different  degrees  of  heterogeneity. 
Most  of  the  work  in  this  area  has  been  done  by  the  Univer¬ 
sity  of  Colorado  at  Colorado  Springs  [10,  33,  9].  The  most 
significant  contribution  originating  from  their  work  is  the 
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idea  of  ensuring  that  compilers  generate  code  with  the  same 
data  alignment  rules.  Our  work  builds  on  what  they  have 
done  by  providing  optimised  translation  based  on  the  task 
type  and  the  degree  of  heterogeneity  between  platforms. 

6.  Discussion  and  Further  Work 

A  basic  heterogeneous  task  migration  system  has  been 
implemented  and  initial  results  are  promising.  We  are  cur¬ 
rently  working  on  a  revised  system  using  local  compiler 
back-end  technology  which  supports  the  migration  of  tasks 
utilising  optimised  code.  To  make  the  migration  process 
even  more  efficient  we  are  looking  at  optimising  our  meta¬ 
information.  Current  performance  figures  suggest  that  our 
current  meta-information  and  corresponding  traversal  tech¬ 
niques  may  be  over  complicated.  To  this  end,  a  generalised 
version  of  the  pointer  map  scheme,  using  bitmaps  to  plot 
required  actions  for  stack  frames  and  objects  is  being  con¬ 
sidered.  Other  methods  of  optimising  meta-information  in¬ 
clude  compression  and  lazy  loading. 

A  general  problem  is  how  to  deal  with  non-migrable  re¬ 
sources  such  as  I/O  and  file  handles,  our  current  solution 
is  to  retain  remote  references  to  them  [26].  An  interesting 
alternative,  is  to  migrate  tasks  to  the  JVM  where  no  tar¬ 
get  mapping  is  defined  using  e.g.  Java  Platform  Debugger 
Architecture  [37].  Task  migration  may  also  be  generalised 
to  encompass  dynamic  software  reconfiguration.  We  have 
yet  to  study  degree  3  heterogeneity  where  e.g.  word  sizes 
and  alignments  of  data  may  differ  between  platforms;  this 
is  particularly  challenging. 
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Abstract 

Cluster  computing  is  presently  a  major  research  area , 
mostly  for  high  performance  computing .  The  work  herein 
presented  refers  to  the  application  of  cluster  computing  in  a 
small  scale  where  a  virtual  machine  is  composed  by  a  small 
number  of  off-the-shelf  personal  computers  connected  by  a 
low  cost  network.  A  methodology  to  determine  the  opti¬ 
mal  number  of  processors  to  be  used  in  a  computation  is 
presented  as  well  as  the  speedup  results  obtained  for  the 
matrix-matrix  multiplication  and  for  the  symmetric  QR  al¬ 
gorithm  for  eigenvector  computation  which  are  significant 
building  blocks  for  applications  in  the  target  image  process¬ 
ing  and  analysis  domain.  The  load  balancing  strategy  is 
also  addressed. 


1.  Introduction 

Several  personal  computer  or  workstation  based  clus¬ 
ter  systems  have  been  developed,  from  commercial  off-the- 
shelf  processors  to  high  performance  ones  such  as  SMP  ar¬ 
chitectures  [3]  and  using  high  performance  networks  like 
Myrinet  [2,  19].  Most  of  the  work  is  devoted  to  the  high 
performance  computing  aiming  to  achieve  the  performance 
of  a  specific  supercomputer  at  a  lower  cost. 

Our  aim  is  not  to  build  a  cluster  of  personal  computers 
for  parallel  processing  but  to  do  parallel  processing  on  al¬ 
ready  existing  group  clusters,  where  each  node  is  a  desktop 
computer  running  the  Windows  operating  system.  These 
clusters  are  characterized  by  having  a  low  cost  network, 
such  as  a  10  Mbits/s  Ethernet,  connecting  different  types 
of  processors,  of  variable  processing  capacity  and  amount 
of  memory,  thus  forming  a  heterogeneous  parallel  virtual 
computer.  Due  to  network  restrictions,  which  do  not  allow 
simultaneous  communication  among  several  nodes,  the  ap- 
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plication  domain  is  restricted  to  one  or  two  dozens  of  pro¬ 
cessors. 

The  need  for  a  methodology  to  determine  the  ideal  num¬ 
ber  of  processors  comes  also  due  to  network  restrictions, 
since  as  the  number  of  processors  increases  the  network 
acts  as  a  communication  bottleneck  and  the  time  spent  in 
data  exchange  can  overcome  the  benefits  of  more  process¬ 
ing  power.  This  is  not  usually  referred  in  the  high  perfor¬ 
mance  clusters  literature,  due  to  the  usually  huge  problem 
size,  however,  in  [17]  a  scheduling  policy  is  studied  for  mul¬ 
tiprocessor  systems  based  on  that  some  applications  cannot 
exploit  the  computational  power  available,  due  to  hardware 
and  software  constraints.  In  [4]  a  performance  model  for 
heterogeneous  processing  was  proposed  but  not  in  the  con¬ 
text  of  processor  co-operation  to  solve  a  task. 

Our  motivation  for  a  parallel  implementation  of  lin¬ 
ear  algebra  algorithms  comes  from  image  and  image  se¬ 
quence  analysis  needs,  posed  by  various  application  do¬ 
mains,  which  are  becoming  increasingly  more  demanding 
in  terms  of  the  detail  and  variety  of  the  expected  analytic  re¬ 
sults,  requiring  the  use  of  more  sophisticated  image  and  ob¬ 
ject  models  (e.g.,  physically-based  deformable  models),  and 
of  more  complex  algorithms,  while  the  timing  constraints 
are  kept  very  stringent. 

A  promising  approach  to  deal  with  the  above  require¬ 
ments  consists  in  developing  parallel  software  to  be  exe¬ 
cuted,  in  a  distributed  manner,  by  the  machines  available  in 
an  existing  computer  network,  taking  advantage  of  the  well- 
known  fact  that  many  of  the  computers  are  often  idle  for 
long  periods  of  time.  Jt  is  quite  common  in  many  organiza¬ 
tions  that  a  standard  network  connects  several  general  pur¬ 
pose  workstations  and  personal  computers,  accumulating  a 
very  substantial  computing  power  that,  through  the  use  of 
appropriate  managing  software,  could  be  put  at  the  service 
of  the  more  computationally  demanding  applications. 

Existing  software,  such  as  the  Windows  Parallel  Virtual 
Machine  (WPVM)  [1],  allows  building  parallel  virtual  com¬ 
puters  by  integrating  in  a  common  processing  environment 
a  set  of  distinct  machines  (nodes)  connected  to  the  network. 
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Although  the  parallel  virtual  computer  nodes  and  the  under¬ 
lying  communication  network  were  not  designed  for  opti¬ 
mized  parallel  operation,  very  significant  performance  gains 
can  be  attained  if  the  parallel  application  software  is  con¬ 
ceived  for  that  specific  environment. 

This  paper  addresses  the  problem  [22]  of  determining, 
from  a  pool  of  available  nodes,  which  ones  should  be  se¬ 
lected  for  building  a  parallel  virtual  computer  that  achieves 
the  fastest  application  response  time,  and  it  also  discusses 
the  issue  of  computational  load  distribution;  the  study  con¬ 
siders  that  the  nodes  available  prior  to  running  the  appli¬ 
cation  may  differ  from  time  to  time,  as  different  users  and 
machines  are  active.  At  every  program  initiation  phase,  the 
highest  performance  computers  from  the  available  set  are 
selected,  in  a  number  that  is  computed  for  optimizing  the 
processing  time. 

The  test  cases  presented,  a  parallel  matrix  multiplication 
algorithm  and  the  QR  algorithm,  while  pertinent  to  many 
advanced  image  analysis  methods,  are  also  a  common  mod¬ 
ule  in  many  other  fields,  such  as  in  simulation  problems.  In 
a  previously  reported  work  [5],  the  step  edge  operator  pro¬ 
posed  by  Shen  and  Castan  [20]  was  also  tested. 

2.  Computational  model 


The  interconnection  network  is  modeled  by  a  temporal 
expression,  Tq,  representing  the  time  required  to  transmit  a 
message  of  nb  bits  between  two  network  nodes,  assuming  a 
distance  1  network. 

Tc  =  TL+nb(^+TE)  (2) 

The  latency  time  Tl  represents  the  time  gap  between  the 
processor  order  to  transmit  and  the  beginning  of  transmis¬ 
sion  and  Te  the  packing  time.  The  logical  topology  of  an 
Ethernet  provides  a  single  channel,  or  bus,  that  carries  Eth¬ 
ernet  signals  to  all  stations,  allowing  broadcast  communica¬ 
tions.  There  is  only  one  signal  channel  delivering  packets 
over  the  network  to  all  stations.  Each  message  is  divided 
into  packets  of  length  46  to  1 500  bytes  of  data  ( packetsize ), 
to  be  sent  sequentially  and  individually  onto  the  shared 
channel.  For  each  packet  the  computer  has  to  gain  access 
to  the  channel  [21].  This  division  of  a  message  into  packets 
leads  to  a  latency  time*  for  each  message  that  is  proportional 
to  the  number  of  packets  ( K )  into  which  it  is  split,  resulting 
equation  3. 

Tcomm  =  KTl  +  nb(-^  +  Te)  (3) 

The  value  of  K  is  given  by  equation  4.  A  typical  value 
for  packetsize  is  1024  bytes. 


Several  computational  models  [23, 7, 14]  were  presented 
in  order  to  estimate  the  processing  time  of  a  parallel  pro¬ 
gram.  Although  they  could  be  adapted  for  the  cluster  of 
personal  computers,  a  specific  and  simplified  model  is  pre¬ 
sented  below.  The  target  machine  is  composed  by  nodes 
with  different  processing  capacities,  resulting  from  differ¬ 
ent  amounts  of  available  memory  and  from  various  proces¬ 
sor  types  and  versions,  connected  by  a  standard  intercon¬ 
nection  network,  such  as  the  Ethernet.  Each  node  of  the 
machine  is  characterized  by  the  processor  capacity  5,  mea¬ 
sured  in  Mflops.  The  network  is  characterized  by  the  num¬ 
ber  of  messages  that  are  allowed  simultaneously,  the  band¬ 
width  LB  measured  in  Mbits/s,  and  by  the  existence  or  not 
of  broadcasting  capacity. 

The  computational  model  for  the  virtual  machine,  de¬ 
scribing  the  behavior  for  a  given  algorithm,  is  obtained  by 
summing  the  time  spent  in  sequential  operations  Ts  and  the 
time  spent  in  parallel  operations  Tp.  Sequential  operations 
include  communications,  data  input/output  and  other  pro¬ 
cessing  that  cannot  occur  in  parallel  due  to  each  particular 
algorithm  characteristics.  Parallel  operations  are  those  that 
the  time  spent  by  one  processor  can  be  divided  by  p  if  p  pro¬ 
cessors  are  used.  The  total  processing  time,  as  a  function  of 
the  number  of  processors  p  and  the  problem  size  n  is  given 
by  equation  1 . 

TT(n,p)  =  Ts(n,p)  +  TP{n,p)  (1) 


K  = 


nb/8 

packetsize 


(4) 


For  a  heterogeneous  virtual  machine  Tl  and  Te  depend 
on  the  processor  speed  S.  Several  experiments  were  con¬ 
ducted  in  order  to  measure  these  parameters,  for  the  net¬ 
work  referred  to  in  the  results  section,  which  is  composed 
by  processors  as  illustrated  in  table  1 .  The  values  were  mea¬ 
sured  for  the  matrix  multiplication  algorithm  over  different 
matrix  sizes,  resulting  the  average  values  of  table  1. 


S(M  flops) 

244 

161 

60 

50 

49 

TL{ps/byte) 

70 

130 

180 

180 

180 

TE(ps/byte) 

0.05 

0.07 

0.13 

0.13 

0.13 

Table  1.  Processors  parameters 


Although  the  Ethernet  physically  allows  broadcasting 
the  WPVM  converts  a  broadcast  in  a  p  processor  machine  to 
p  -  1  messages  [1].  Therefore,  to  model  correctly  a  broad¬ 
cast  the  time  spent  in  one  message  has  to  be  multiplied  by 
P-  1* 

Independent  communications  over  rows  or  columns,  ei¬ 
ther  for  1-D  or  2-D  grids,  can  originate  network  collisions. 
Examples  are  all  slave  processes  trying  to  send  results  to 
the  master  process  at  the  same  time;  or  for  the  matrices 
multiplication  algorithm,  in  each  step,  the  distribution  of 
the  matrices  are  independent  over  rows,  for  one  matrix,  and 
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over  columns  for  the  other  matrix.  To  avoid  collisions  a 
set  of  communication  routines  using  a  ring  communication 
pattern,  as  shown  in  figure  1,  were  developed.  They  allow 
processes  to  synchronize  by  establishing  the  order  of  com¬ 
munications  according  to  the  processes  position  on  the  grid. 


MX} -q 

tn-on — □ 


Signal 


Figure  1.  Communication  pattern  for  two  in 
dependent  row  broadcasts 


The  parallel  component  Tp  of  the  computational  model, 
equation  5,  represents  the  operations  that  can  be  divided 
over  a  set  of  p  processors  obtaining  a  speedup  of  p,  i.e.  op¬ 
erations  without  any  sequential  part. 


Figure  2.  Performance  of  the  matrix  multipli¬ 
cation  algorithm  on  a  161  Mflop  peak  perfor¬ 
mance  processor  as  a  function  of  matrix  size 


3.  Load  balancing  strategy 


Tp{n,p) 


ELi  Si 


In  this  section  a  static  load  distribution  algorithm  is  pre¬ 
sented  and  issues  related  to  the  optimization  of  processing 
time  in  a  heterogeneous  environment  are  discussed. 


The  numerator  ^(n)  is  the  cost  function  of  the  algorithm 
measured  in  floating  point  operations  (flops)  as  a  function 
of  the  problem  size  n.  For  example,  to  multiply  square  ma¬ 
trices  of  size  n,  the  cost  is  ^(n)  =  2 n3  [8].  This  oper¬ 
ation  count  does  not  include  memory  operations  resulting, 
therefore,  a  higher  complexity.  To  obtain  a  correct  operation 
count  one  should  consider  the  memory  references  made  and 
have  an  estimation  of  the  memory  access  time.  The  nodes  of 
the  virtual  machine  have  different  levels  of  memory  (cache, 
main  memory,  and  disk)  with  different  access  times,  and 
one  cannot  predict  how  many  accesses  are  made  to  each 
one. 

Figure  2  shows  the  processing  capacity  achieved  by  an 
161  Mflop  peak  performance  processor  for  the  matrix  mul¬ 
tiplication  algorithm.  The  computational  cost  is  tp(n)  ~ 
21.5 n3 flops.  Figure  2  also  shows  that  a  non  block  oriented 
algorithm  cannot  assure  a  constant  coefficient  of  ip( n ), 
which  is  a  requirement  in  order  to  be  able  to  estimate  the 
time  the  processors  will  take  to  execute  the  algorithm.  From 
this  point  on  the  coefficient  of  n )  will  be  referred  to  as  the 
algorithm  constant  (3.  The  value  (3  does  not  depend  on  the 
processor  but  rather  is  a  characteristic  of  the  algorithm. 

The  denominator  of  equation  5  is  the  processing  capacity 
used  which  is  obtained  by  summing  the  individual  process¬ 
ing  capacities  of  the  machines.  For  this  equation  to  be  valid 
each  machine  should  not  take  more  than  Tp  seconds  to  pro¬ 
cess  its  part.  This  assumes  a  perfect  load  balancing  in  the 
heterogeneous  machine. 


3.1.  Data  distribution 

To  avoid  the  slowest  processors  to  determine  the  parallel 
processing  time,  the  load  should  be  distributed  proportion¬ 
ally  to  the  capacity  of  each  processor.  The  aim  is  to  assign 
the  same  amount  of  processing  time  which  may  not  corre¬ 
spond  to  the  same  amount  of  data. 

The  matrices  are  organized  in  square  blocks  of  data 
which  are  assigned  to  the  processor  grid.  To  achieve  a  bal¬ 
anced  distribution  in  the  heterogeneous  machine  the  number 
of  blocks  assigned  to  each  processor  should  be  proportional 
to  its  processing  capacity  compared  to  the  entire  machine: 


2~jk=i  ^ 

The  load  index  Z*  although  theoretically  correct,  is  not  fully 
applicable  in  practice  since  the  number  of  blocks  assigned 
has  to  be  an  integer  value.  As  an  example,  for  a  machine 
composed  by  6  processors  of  capacity  {244,  244,  161,161, 
60,  50}  Mflops,  U  would  be  {0.265,  0.265,  0.175,  0.175, 
0.065,  0.054}.  To  distribute  a  matrix  of  size  1800  over  a 
(1,6)  processor  grid  the  assignment  would  be  1800  rows  by 
{477, 477,  315,  315,  118,  98}  columns  respectively. 

The  strategy  implemented  is  to  compute  the  number  of 
blocks  to  assign  to  each  processor  rounding  the  real  value 
obtained  down  to  the  nearest  integer,  so  that  some  blocks 
are  left  to  be  assigned.  Then,  to  obtain  an  optimal  solution 
the  remaining  blocks  are  assigned  one  at  a  time  to  the  grid 
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of  processors,  choosing  the  one  that  will  take  less  time  to 
finish  the  job. 

For  the  test  case  presented,  consider  a  block  size  of  25  el¬ 
ements,  which  lead  to  an  assignment  in  terms  of  number  of 
blocks,  of  { 19,  19, 12,  12, 4,  3}  summing  69  in  a  total  of  72 
blocks,  leaving  3  blocks  unassigned.  Using  the  time  com¬ 
plexity  analysis  presented  in  the  next  section  for  the  matrix 
multiplication  algorithm,  Tp  =  21.5 n3/5,  the  estimated 
computational  time  per  processor  is  {135.6,  135.6,  129.8, 
129.8,  116.1,  104.5}  seconds.  Each  block  will  take  {7.14, 
7.14,  10.82,  10.82,  29.03,  34.83}  seconds  in  each  proces¬ 
sor  respectively.  This  block  processing  time  is  summed  to 
the  total  time  each  remaining  block  being  assigned  to  the 
processor  that  would  finish  first.  The  first  block  is  assigned 
to  processor  6  and  the  last  two  blocks  to  processors  3  and 
4,  resulting  an  estimated  processing  time  of  {135.6,  135.6, 
140.6,  140.6,  116.1,  139.3}  seconds.  A  perfect  load  bal¬ 
ancing  cannot  be  achieved,  however  for  this  block  size  it  is 
the  optimal  assignment,  i.e.  the  assignment  that  leads  to  the 
minimum  processing  time.  Figure  9  shows  the  processing 
time  measured  for  each  processor. 

Another  issue  in  data  distribution  for  a  heterogeneous 
machine  is  to  keep  the  load  balance  in  the  whole  algo¬ 
rithm.  For  some  algorithms,  such  as  tridiagonal  reduction 
and  LU  factorization,  in  each  iteration  part  of  the  matrix  is 
fully  computed  and  not  visited  again,  the  working  matrix 
being  smaller  from  step  to  step.  This  can  lead  to  an  imbal¬ 
ance  load  if  the  distribution  is  not  cyclic.  For  the  example 
above,  if  contiguous  blocks  are  assigned  to  each  processor, 
one  of  the  fastest  processors  would  be  idle  after  computing 
19  blocks  of  the  matrix  ,  remaining  53  blocks  to  be  pro¬ 
cessed. 

To  overcome  this  load  imbalance,  blocks  are  organized 
in  balanced  groups.  Being  f*  the  load  index  of  processor  i, 
one  define  group  block  Gp  as: 

GB  =  (7) 

If  Gp/Q  <  2  then  Gb  =  2 /mm(Zi),  where  Q  is  the  num¬ 
ber  of  column  processors.  For  a  (P,  Q )  grid  the  algorithm 
is  applied  to  columns  and  rows  independently,  considering 
the  processing  capacity  by  column  and  row  respectively,  as 
shown  in  table  2. 

For  the  example  given  above  Gb  =  1/0.054  =  18 
blocks,  giving  a  group  block  of  {5,  5,  3,  3,  1,  1}. 

With  this  strategy  it  is  guaranteed  that  from  the  begin¬ 
ning  to  the  end  of  the  algorithm  all  processors  are  involved 
in  proportion  of  their  load  indices  Z*,  allowing  an  effective 
load  balancing.  When  the  last  group  block  is  being  pro¬ 
cessed,  the  last  8  blocks  would  be  computed  by  the  slowest 
processors;  it  is  reasonable  that  in  cases  where  some  pro¬ 
cessors  cannot  participate  due  to  the  lack  of  data,  it  should 
be  the  fastest  ones  doing  the  computation.  Therefore,  the 


cyclic  distribution  is  used  inside  each  group  block. 

3.2.  Data  redistribution 

In  order  to  exploit  the  computational  capacity  of  the  tar¬ 
get  machine,  the  algorithms  must  be  implemented  in  order 
to  increase  the  computation  to  communication  ratio,  mainly 
due  to  the  slow  network.  Therefore,  data  redistribution  is 
allowed  in  order  to  switch  to  the  optimal  grid  computed  for 
each  algorithm.  Data  distribution  is  represented  by  system 
independent  objects,  allowing  the  system  to  switch  between 
two  unrelated  processor  grids. 

The  cost  of  redistribution  is  estimated  by  the  communi¬ 
cation  of  n2  elements  for  a  matrix  of  size  n,  which  is  the 
worst  case,  i.e.  every  element  being  allocated  to  a  differ¬ 
ent  processor.  The  redistribution  algorithm  starts  from  the 
first  processor  ( 1 , 1 )  to  the  last,  changing  data  synchronously 
with  the  remaining  processors. 

For  related  grids,  e.g.  switching  from  (1,6)  to  (1,7),  the 
system  evaluates  if  the  gain  in  time  due  to  the  addition  of 
one  processor  is  overcome  by  the  data  redistribution  time. 
In  that  case  the  grid  change  does  not  occur. 

3.3.  Block  size 

The  block  size  should  be  chosen  according  to  the  fol¬ 
lowing  conditions:  first,  it  should  maximize  the  individual 
processing  capacity,  that  as  shown  in  figure  2  degrades  for 
a  block  size  1,  and  second,  to  allow  the  implementation  of  a 
load  balancing  distribution.  For  the  machines  tested  a  block 
size  in  the  range  1 5  to  40  ensure  an  almost  constant  process¬ 
ing  capacity. 

For  a  sequence  of  parallel  algorithms,  e.g.  for  eigenvec¬ 
tor  computation  where  different  grids  are  used,  the  block 
size  should  satisfy  all  grids  in  terms  of  load  balancing  since, 
although  there  is  data  redistribution,  this  parameter  remains 
unchanged. 

3.4.  Processor  selection  policy 

The  system  keeps  a  record  of  the  computers  enrolled  in 
the  parallel  virtual  machine  ordered  by  decreasing  compu¬ 
tational  capacity.  If  only  part  of  the  machine  is  needed  to 
execute  the  algorithm  the  computers  are  selected  from  the 
fastest  to  the  slowest  one. 

A  computer  is  considered  available  for  parallel  process¬ 
ing  if  there  is  no  user  activity  for  at  least  half  the  process¬ 
ing  time  of  the  last  parallel  algorithm.  If  a  user  starts  us¬ 
ing  his/her  computer  during  a  parallel  execution,  the  system 
does  not  transfer  the  work  to  another  computer;  it  completes 
the  current  job  and  then  marks  the  computer  as  unavailable. 
For  the  problem  size  addressed,  whose  processing  time  is 
expected  to  be  of  a  few  minutes,  this  policy  is  satisfactory. 
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4.  Optimization  of  the  processing  time  4.1.  Application  to  a  homogeneous  machine 


A  parallel  algorithm  may  have  two  aims:  to  obtain  a 
better  accuracy  of  results  by  using  a  more  detailed  domain 
which  could  not  be  possible  in  a  single  processor,  usually 
due  to  memory  limitations,  or,  for  a  given  accuracy,  to  ob¬ 
tain  a  reduction  in  the  processing  time.  The  time  gain  ob¬ 
tained  with  the  parallel  algorithm  is  usually  called  Speedup 
and  is  defined  as  the  quotient  of  the  serial  algorithm  time 
(Ti)  over  the  parallel  algorithm  time  (7r). 


Speedup  —  ^ 


(8) 


For  a  given  algorithm,  characterized  by  the  constant  /?, 
and  for  size  n  matrices,  p  can  be  obtained  by  solving  equa¬ 
tion  9  in  order  top  [5]. 


(9) 


For  a  homogeneous  machine  equation  5  simplifies  to 
TP(n:p)  =  and  the  communication  parameters  TL 
and  Te  assume  the  same  value  for  all  machines,  allowing  a 
straightforward  solution. 


4.2.  Application  to  a  heterogeneous  machine 


Depending  on  how  the  serial  processing  time  is  mea¬ 
sured  one  can  have  different  definitions  of  Speedup.  Rel¬ 
ative  Speedup  is  obtained  if  the  serial  time  is  the  processing 
time  of  the  parallel  algorithm  in  a  single  node  of  the  parallel 
computer.  Real  Speedup  is  obtained  if  the  serial  time  is  the 
processing  time  of  the  most  efficient  sequential  algorithm 
in  a  single  node  of  the  parallel  computer.  Absolute  Speedup 
is  defined  when  the  serial  time  is  obtained  for  the  fastest 
sequential  algorithm  executed  in  the  fastest  sequential  com¬ 
puter  available  [18].  In  the  context  of  the  envisaged  applica¬ 
tions  of  the  parallel  virtual  machine,  we  define  Speedup  as 
the  ratio  between  the  processing  time  of  the  serial  version 
in  the  computer  that  controls  the  parallel  execution  (mas¬ 
ter),  over  the  processing  time  of  the  parallel  program.  This 
is  the  effective  gain  as  seen  by  the  user,  who  has  a  choice 
between  his/her  own  single  machine  (master)  or  the  parallel 
virtual  machine;  the  definition  is  also  globally  fair  when  the 
master  computer  is  one  of  the  fastest  available,  which  is  the 
case  in  the  test  cases  presented  below.  In  a  parallel  virtual 
machine  it  is  quite  common  that  each  node  of  the  computer 
network  is  not  fully  available  for  the  user  that  is  running 
a  parallel  application.  The  application  should  not  schedule 
work  for  nodes  that  are  in  use  by  other  users,  and  therefore 
it  should  have  a  record  of  the  ones  that  are  free.  The  aim 
in  scheduling  work  for  distributed  processing  is  to  obtain  a 
processing  time  that  is  as  small  as  it  can  be  obtained  for  that 
particular  network,  even  if  some  of  the  nodes  are  left  in  the 
idle  state.  Therefore,  the  relevant  parameter  to  be  consid¬ 
ered  is  the  Speedup  in  detriment  of  the  Efficiency,  which  is 
often  used  in  other  contexts. 

Given  the  above  definitions,  one  can  state  the  goal  of  the 
work  herein  reported  as  the  determination  of  the  optimum 
number  of  processors  using  a  criterion  of  minimum  process¬ 
ing  time.  The  optimal  number  of  processors  p ,  which  min¬ 
imizes  Tt(ti,p),  is  the  one  for  which  an  increase  on  the 
serial  component,  due  to  the  addition  of  one  more  proces¬ 
sor,  will  be  balanced  by  the  gain  obtained  on  the  processing 
time  of  the  parallel  component. 


For  a  heterogeneous  machine  another  degree  of  com¬ 
plexity  is  added  to  equation  9:  first,  processors  have  differ¬ 
ent  computational  capacities  (5)  and  second,  the  communi¬ 
cation  parameters  Tl  and  TE  also  vary  with  5,  as  shown  in 
table  1 . 

To  tackle  this  problem  one  first  orders  the  nodes  by  de¬ 
creasing  value  of  Si  (the  capacity  of  node  i),  and  then  sched¬ 
ules  the  work  from  the  fastest  to  the  slowest  free  node,  re¬ 
sulting  the  denominator  of  equation  5:  Sr(p)  =  Yfi=\  Si. 
To  compute  the  first  derivative  of  TV  in  order  to  p  it  is  re¬ 
quired  to  find  the  sum  Sp{p ),  which  cannot  be  computed 
beforehand  since  one  does  not  know  how  many  processors 
will  be  used.  The  function  St(p)  increases  monotonically 
with  p,  having  a  growth  rate  that  decreases  with  increasing 
p,  as  shown  in  figure  3  for  a  machine  composed  by  proces¬ 
sors  of  capacities  {244,  244,  161,  161,  60,  50,  49}  Mflops 
in  decreasing  order. 


1  2  3  4  5  6  7 

Number  of  processors 


Figure  3.  Processing  capacity  of  the  hetero¬ 
geneous  machine  as  a  function  of  the  proces¬ 
sors  used 

The  aim  is  to  approximate  ST(p)  by  a  polynomial  func¬ 
tion  in  p  in  order  to  be  able  to  solve  equation  9.  A  first 
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order  polynomial  function  as  used  for  a  homogeneous  ma¬ 
chine  is  not  adequate  here.  The  ideal  polynomial  function 
would  be  one  that  passes  in  all  points  of  St(p)\  however,  its 
computation  time  may  be  significant  for  a  large  number  of 
processors.  The  solution  adopted  was  an  iterative  quadratic 
approximation.  The  first  function  is  defined  by  zero  and  the 
extreme  points  of  SV(p).  The  iterative  process  allows  the 
reevaluation  of  the  cost  function  Xr(n,p)  in  the  neighbor¬ 
hood  of  the  solution  computed.  In  each  iteration  only  half  of 
the  processors  used  in  the  last  iteration  are  considered  being 
the  polynomial  function  defined  by:  if  P  is  the  total  number 
of  processors,  the  solution  for  iteration  (i  -  1)  then 

in  iteration  i  the  function  is  defined  by  the  three  points  of 
equation  10. 


St(p*~1  ±  P/21+1)  and  Srip x))  i  —  1,2,... 

(10) 

The  second  degree  polynomial  function  has  the  same  be¬ 
havior  as  St{p)  and  is  written  as: 

Ps(p)  ~  ap2  +bp  +  d  (11) 

resulting  the  first  derivative  of  Tp(n ,  p)  in  order  to  p  in: 

2^  =  r(-TTrri')=0  <l2) 

op  op  \apz  4-  bp  -F  d  J 

which  must  be  solved  in  order  to  obtain  the  number  of  pro¬ 
cessors  p  that  minimizes  the  total  processing  time. 

If  the  logical  grid  of  processors  affects  the  processing 
time,  then  changing  to  a  2D  grid  (e.g.  (r,  c)  grid)  or  3D 
(e.g.  hypercube),  one  or  two  dimensions  are  added  to  the 
problem  respectively.  For  the  2D  grid  the  quadratic  approx¬ 
imation  with  p  =  rc  becomes: 

Ps  (r,  c)  =  a(rc)2  +  b(rc)  +  d  (13) 

The  communication  parameters  Tl  and  Te  also  need 
to  be  modeled  by  a  function  of  p  in  order  to  solve 
dTs(n,p)/dp.  To  transmit  a  message  from  computer  A  to 
B  the  latency  and  packing  time  depend  on  the  speed  of  pro¬ 
cessor  A.  If  one  can  predict  the  amount  of  data  each  pro¬ 
cessor  will  be  responsible  to  transmit,  one  can  estimate  the 
time  spent  in  communications  by  the  whole  machine.  Ac¬ 
cording  to  the  data  distribution  algorithm  to  each  processor 
is  allocated  an  amount  of  data  proportional  to  its  relative 
speed  in  the  heterogeneous  machine:  U  = 

Therefore,  functions  to  model  these  parameters  are  defined 
by  equations  14  and  15,  corresponding  to  an  weighted  mean 
of  these  values  for  each  possibility  of  p  processors.  The  val¬ 
ues  of  {Ti)i  and  (Te)*  are  shown  in  table  1. 

Ttl(p)  =  £P  ,  li  X  (TL)i  (14) 


Tte{p)  =  h  x  (TE)i  (15) 

For  the  machine  considered  (figure  3),  the  functions 
Ttl{p)  and  Tte(p)  are  shown  in  figures  4  and  5  respec¬ 
tively.  In  those  figures  it  is  also  shown  a  first  degree  poly¬ 
nomial  approximation  to  be  included  in  Ts(n,  r,  c). 


Figure  4.  Approximation  for  TTl(p)  per  byte 


Figure  5.  Approximation  for  TTe(p)  per  byte 

The  (r,  c )  configuration  that  minimizes  the  processing 
time  is  obtained  by  VTr{n,r,c)  —  0.  Since  one  wants  to 
compute  the  ideal  grid  (r,  c)  for  a  given  problem  size  n,  the 
first  derivative  of  Tx(n,  r,  c )  in  order  to  n  is  zero.  Thus,  the 
optimal  configuration  is  obtained  by  solving  the  system  of 
equations  16. 


dTT{n,r,c)  _  q 
dr 


dTT{n,r,c )  _  q 
dc 


(16) 
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4.3.  Applying  the  methodology  to  a  matrix  multi¬ 
plication  algorithm 


+/3 


Ps(r ,  c) 


(20) 


The  methodology  presented  above  will  be  tested  with  an 
improved  implementation  of  the  matrix  multiplication  oper¬ 
ations  [11].  Figure  6  shows  an  hypothetical  data  assignment 
for  a  (2,3)  processor  grid.  For  simplicity,  the  blocks  dis¬ 
played  are  formed  by  contiguous  data,  although  the  block 
cyclic  data  distribution  is  used  [10]. 

To  compute  the  matrix  product  C  =  Ax  B,  in  each  iter¬ 
ation  of  the  algorithm  each  processor  multiplies  one  column 
block  of  A  by  one  row  block  of  B ,  updating  the  correspon¬ 
dent  block  of  C.  The  shadowed  area  in  matrix  C  represents 
the  block  that  processor  (0, 0)  has  to  update  in  each  itera¬ 
tion. 


♦ 

Figure  6.  Matrix  multiplication  operations 

Considering  a  grid  (r,  c)  of  processors,  the  matrices  A  = 
(m,k)>  B  =  (kj)  and  C  —  (ra,Z)  the  amount  of  data 
required  to  broadcast  matrix  A  over  the  rows  of  processors 
is: 


mk , 

—  —  (c  -  l)rc  =  mk(c  —  1)  (17) 

Note  that  (c—  1)  appears  because  the  broadcast  is  in  fact  per¬ 
formed  by  sequential  communications.  To  broadcast  matrix 
B  over  the  column  of  processors  it  is  required  to  transmit: 

k  l 

— -(r  —  l)cr  =  kl(r  -  1)  (18) 

The  time  required  to  compute  the  inner  loop  products  is 
given  by: 


TP 


a 


(19) 


where  Sr(r,c)  is  the  processing  capacity  of  the  heteroge¬ 
neous  machine  when  rc  processors  are  used.  The  value  /? 
for  the  matrix  multiplication  is  21.5,  as  given  in  section  2. 
The  total  estimated  processing  time,  assuming  square  ma¬ 
trices  of  size  n,  is  expressed  as: 


TT(n,r,  c) 


,n2(r  +  c  —  2) 
packetsize 


)Ttl{t ,  c) 


+(n2(r  +  c  —  2  )){LB  1  +  Tte (r,  c)) 


Depending  on  the  data  types  used  (float  or  double)  the 
correspondent  communication  factors  have  to  represent  the 
amount  of  data  in  bytes.  LB  is  the  bandwidth  per  byte. 

For  the  machine  of  figure  3  the  quadratic  approximation, 
equation  11,  becomes  Ps(r ,  c)  -  -17.595(rc)2  +  261.6rc. 
This  approximation  is  close  to  the  real  curve  ST(r,c)  for 
values  0  <  rc  <  7.  Outside  this  domain  the  polynomial 
function  may  introduce  false  minima  in  the  processing  time 
function.  Therefore,  the  minimization  must  be  restricted  to 
the  allowed  domain  by  the  number  of  processors  available. 
This  can  be  accomplished  by  introducing  the  Lagrange  mul¬ 
tipliers  [15]  in  the  system  of  equations  16.  An  additional 
function  to  restrict  the  domain  is  included: 

f  dTs(n,r>c)  ,  dTp(n,r,c)  _  \ 

dr  I"  dr  ~  ~AC 

4  dTs(n,r,c)  dTP(n,r,c)  _  (21) 

,  A  (rc  -  7)  =  0 

The  following  figures,  7  and  8,  present  results  for  ma¬ 
trices  of  size  1 800.  Figure  7  displays  the  communication 
estimated  (Est.)  and  measured  (Meas.)  time  for  one  and 
two  rows  of  processors,  limited  to  7  processors.  And  fig¬ 
ure  8  displays  the  total  processing  time  7V  (n,  r ,  c)  obtained 
by  estimation  with  the  quadratic  approximation  for  machine 
processing  capacity  (Tot.  E),  by  estimation  using  the  exact 
processing  capacity  (Tot.  R),  and  the  measured  time  (Tot. 
M).  The  communication  times  are  modeled  correctly,  exist¬ 
ing  only  a  slight  difference  for  some  grids.  The  total  esti¬ 
mated  processing  time  differs  from  the  measured  one  due  to 
the  quadratic  approximation  which  underestimates  the  pro¬ 
cessing  capacity  in  some  cases  and  overestimates  in  others, 
although  the  behavior  is  similar  to  the  measured  curve  and 
it  does  not  introduce  false  minima  in  the  processing  time 
function.  The  curve  obtained  with  the  real  processing  ca¬ 
pacity  of  the  heterogeneous  machine  shows  that  the  overall 
model  is  correct  and  that  the  processing  time  can  be  accu¬ 
rately  estimated. 

Solving  the  system  of  equations  21,  the  values  of  r  = 
c  —  2.65  are  obtained  for  n  =  1800  and  LB  = 
100 Mbits fs.  Since  one  wants  an  integer  solution,  it  can 
be  assumed  c  =  3  which  implies  r  =  2,  since  rc  <  7. 
The  grid  (3,2)  would  be  equivalent.  Figure  8  shows  that  the 
minimum  is  obtained  for  grid  (2,3),  confirming  the  system 
solution,  although  there  is  an  increase  in  the  processing  time 
compared  to  the  estimation.  This  is  the  consequence  of  an 
imbalance  grid  which  cannot  be  overcome  for  that  machine. 
Table  2  shows  the  processor  layout  for  grid  (2,3).  The  first 
two  columns  of  processors  are  equilibrated  what  does  not 
happen  for  column  3,  in  which  either  processor  (1,3)  will 
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Columns  of  processors  (c) 


Figure  7.  Communications  for  the  matrix  mul¬ 
tiplication  algorithm  (matrix  size  1800) 


244 

161 

60 

=465 

244 

161 

50 

=455 

=488 

=322 

=110 

Table  2.  Processor  layout  for  grid  (2,3) 


Figure  9.  Matrix  multiplication  processing 
time 


Figure  8.  Processing  time  for  the  matrix  mul 
tiplication  algorithm  (matrix  size  1800) 


be  underloaded  or  processor  (2,3)  will  be  overloaded,  de¬ 
laying  all  other  processors  as  they  will  be  always  waiting  to 
communicate. 

Figure  9  shows  the  processing  time  for  all  processors, 
where  it  can  be  seen  that  processor  6  is  delaying  the  process 
for  grid  (2,3).  Grid  (1,6)  is  better  balanced  but  the  ideal 
load  balance  is  not  achieved  due  to  the  data  blocks  indivis¬ 
ibility.  For  this  network,  due  to  processor  relation  in  pro¬ 
cessing  speed,  a  balanced  load  can  only  be  achieved  with 
small  blocks  of  data.  The  squared  block  size  used  was  25. 
A  smaller  block  size,  e.g.  10,  while  improving  the  load  bal¬ 
ance,  would  decrease  the  individual  performance  of  proces¬ 
sors  due  to  a  sub-utilization  of  the  processors  cache  mem¬ 
ory. 

Note  that  although  grid  (2,3)  is  less  balanced  and  there  is 
one  processor  that  takes  more  time,  it  makes  a  better  so¬ 


lution  than  grid  (1,6)  due  to  the  fact  that  this  grid  requires 
more  communication  time,  as  it  can  be  seen  in  figure  7. 

5.  Results 

In  this  section  results  for  tridiagonal  reduction  (TRD), 
LU  and  QR  factorization  algorithms  in  the  heterogeneous 
machine  represented  in  figure  3  for  an  Ethernet  network 
at  100  Mbits/s  are  presented.  Figure  10  shows  the  perfor¬ 
mance  of  each  algorithm  in  a  single  processor.  The  QR  per¬ 
formance  is  divided  by  2  for  displaying  purposes.  As  shown 
before  for  the  matrix  multiplication  algorithm,  the  processor 
performance  is  kept  almost  constant  for  the  block  versions 
of  these  algorithms,  for  matrices  greater  than  400  elements. 
The  correspondent  (3  value  considered  for  each  algorithm  is 
the  average  in  that  domain.  The  square  block  size  used  in 
all  cases  varies  from  15  to  40.  There  is  some  variation  in  the 
processor  performance  for  a  given  matrix  size,  mainly  due 
to  the  operating  system  (Windows  NT)  which  stochastically 
has  some  activity;  however,  this  represents  a  variation  in  the 
processing  time  below  1%. 

The  estimated  values  presented  below  are  obtained  by 
applying  the  system  of  equations  21  using  the  time  function 
of  each  algorithm  respectively. 

5.1.  LU  factorization  algorithm 

The  LU  factorization  algorithm  is  applied  in  order  to 
solve  directly  a  system  of  equations.  The  implementation 
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Figure  10.  Performance  of  LU,  QR  and  TRD 
block  algorithms  on  a  161  Mflop  peak  perfor¬ 
mance  processor  as  a  function  of  matrix  size 


is  the  right-looking  variant  where  algorithm  details  can  be 
found  in  [9],  For  a  (r,  c)  grid  of  processors,  the  amount  of 
data  (double/float)  transmitted  in  the  parallel  matrix  update 
is: 


n2 

(r  +  c-  2)\ 

and  the  parallel  processing  time  is: 


(22) 


Figure  11.  LU  processing  time  for  a  matrix  of 
size  1800 


Tp(n,r,c)  =  P-q-j — r  +  0(n2)  (23) 

ST(r,c ) 

The  /?  for  LU  is  7.5.  There  is  a  component  of  complexity  n2 
correspondent  to  the  computations  made  by  the  pivot  pro¬ 
cessor.  Figure  1 1  shows  the  processing  time  estimated  and 
measured  for  a  matrix  of  size  1800.  Although  it  is  hardly 
perceptible  in  the  figure,  the  optimum  value  estimated  for 
(r,c)  is  (1,5).  In  practice  the  optimum  is  grid  (1,4),  which 
outperforms  grid  (1,5)  by  only  0.5  seconds.  In  this  case 
the  difference  is  due  to  the  quadratic  approximation  for  ma¬ 
chine  processing  capacity.  If  the  real  values  are  used  the 
estimated  optimum  is  (1,4). 

Figure  12  shows  the  estimated  (E)  and  measured  (M) 
communication  times  for  matrices  of  size  1200  and  1800. 
In  general  the  communications  are  well  modeled.  The  dif¬ 
ferences  observed  are  less  than  3  seconds.  This  can  lead  to  a 
grid  selection  that  is  not  the  optimal  one;  however,  since  the 
processing  times  obtained  for  grids  (1,4),  (1,5)  and  (1,6)  are 
69.1,  69.6  and  70.0,  the  main  drawback  would  be  to  have 
unnecessary  processors  allocated. 

Figure  13  shows  the  load  distribution  for  the  matrix  of 
size  1800.  For  up  to  5  processors  a  good  load  balancing 
is  achieved,  with  processors  taking  almost  the  same  time 
to  process  the  data  allocated  to  them.  The  block  size  from 
processor  (1,1)  to  (1,5)  is  1800  rows  by  500,  500,  340,  340 


Figure  12.  Estimated  (E)  and  measured  (M) 
communications  for  LU  algorithm 


and  120  columns  respectively.  Ideally  they  should  receive 
504, 504,  333,  333  and  124  columns. 

5.2.  Tridiagonal  reduction  algorithm 

The  tridiagonal  reduction  algorithm  (TRD)  is  a  step  in 
the  computation  of  the  eigenvalues  and  eigenvectors  of  a 
symmetric  matrix.  Details  of  the  algorithm  can  be  found  in 
[6].  For  a  (r,c)  grid  of  processors,  the  amount  of  data  to 
transmit  is: 

2n(r  -  1)  H-  4 n2(rc  -  1)  (24) 

for  computation  and  broadcast  of  Householder  vectors,  par¬ 
allel  matrix  update  and  matrix  vector  products.  The  parallel 
processing  time  is: 

TP(n,  r,  c )  =  0—^  +  0(n2)  (25) 

Sr{r,  c) 

The  0  for  TRD  is  28  and  there  is  also  a  negligible  term 
in  n2.  Figure  14  shows  the  processing  time  for  a  matrix 
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Processingtime  (s) 


Figure  13.  LU  load  distribution  for  a  matrix  of 
size  1800 


of  size  1200.  For  grids  (1,1)  to  (1,4)  the  estimated  time  is 
higher  than  the  measured  one;  the  maximum  error  occurs 
for  grid  (1,4)  which  coincides  with  the  maximum  error  in 
the  quadratic  approximation  of  computational  capacity.  The 
minimum  is  correctly  determined  as  grid  ( 1 ,4).  Again  if  grid 
(1,3)  was  chosen  the  total  time  would  be  marginally  higher: 
104.7  s  instead  of  100.0  s.  To  guarantee  the  selection  of 
the  best  grid  the  scheduler  can  operate  with  real  values  of 
processing  capacity  for  estimating  the  processing  time  in 
the  neighborhood  of  the  solution  obtained  by  the  system  of 
equations  21. 


Figure  14.  Tridiagonal  reduction  processing 
time  for  a  matrix  of  size  1200 

Figure  15  shows  the  estimated  (E)  and  measured  (M) 
communication  times  for  matrices  of  size  800  and  1200. 
The  more  significant  differences  are  for  matrix  of  size  1200 
where  communications  are  overestimated.  In  all  cases  the 
difference  is  below  1.1  second. 

Figure  16  shows  the  load  distribution  for  the  matrix  of 
size  1800.  For  grid  (1,4)  a  good  load  balancing  is  achieved. 
For  grid  (1,5)  one  process  takes  3  seconds  less  than  the  oth¬ 


Columns  of  processors 


Figure  15.  Estimated  (E)  and  measured  (M) 
communications  for  Tridiagonal  reduction  al¬ 
gorithm 


ers  because  it  was  assigned  one  block  less  of  size  20.  The 
data  allocated  to  each  processor  was  1 200  rows  by  340, 320, 
220,  220  and  100  columns  respectively;  ideally  it  should  be 
1200  by  336,  336,  222,  222,  83.  Grids  (1,6)  and  (1,7)  are 
not  well  balanced  also  due  to  block  indivisibility. 


Figure  16.  Tridiagonal  reduction  load  distrl 
bution  for  a  matrix  of  size  1200 


5.3.  QR  iteration  algorithm 

The  QR  iteration  is  the  last  step  in  the  eigenvector  com¬ 
putation  sequence,  preceded  by  the  tridiagonal  reduction  of 
a  symmetric  matrix  and  orthogonal  matrix  computation. 

Synthetically,  the  procedure  is  to  compute  Givens  rota¬ 
tions  in  order  to  reduce  the  tridiagonal  matrix  into  a  diago¬ 
nal  one  whose  elements  are  the  eigenvalues.  Eigenvectors 
are  computed  by  updating  the  orthogonal  matrix,  resulting 
from  the  tridiagonal  operation,  with  the  rotations.  Each  ro¬ 
tation  affects  only  two  columns  of  the  orthogonal  matrix;  a 
detailed  explanation  is  given  in  [12]. 
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The  parallelization  implemented  takes  advantages  of  the 
fact  that  one  rotation  updates  only  two  columns  without 
inter-row  dependencies.  For  the  tridiagonal  reduction  a  col¬ 
umn  oriented  distribution  is  more  favorable;  however,  that 
data  allocation  will  imply  communications  between  bound¬ 
ary  columns,  with  the  additional  drawback  of  using  cyclic 
distributions,  which  increase  the  boundary  columns  drasti¬ 
cally.  A  column  oriented  algorithm  applying  the  technique 
of  considering  multiple  bulges  [13]  was  implemented,  but 
only  a  marginal  speedup,  below  1.5,  was  obtained  due  to 
the  fact  that  multiple  bulges  increase  the  number  of  itera¬ 
tions  required  which,  associated  to  boundary  communica¬ 
tions,  is  not  suited  for  the  slow  bus  network  of  the  target 
machine. 

Alternatively,  it  was  given  the  possibility  of  data  redis¬ 
tribution  in  order  to  match  the  ideal  processor  grid  for  each 
algorithm.  In  this  case,  QR  iteration  was  a  row  oriented 
strategy. 

The  QR  iteration  has  two  computational  tasks:  one,  to 
do  the  bulge  chase  of  order  n2,  and  the  other  to  update  the 
orthogonal  matrix  of  order  n3: 

Tp  =  13  f£cj  +  e<"2>  <26> 

The  (3  for  QR  is  43.  The  time  to  compute  the  chases  is  in 
fact  negligible  compared  to  the  ©(n3)  term  (e.g.,  for  the 
matrix  of  size  1600  used  it  takes  2. 1  seconds  to  compute  the 
chases  and  721  seconds  to  update  the  matrix  in  a  244  Mflop 
computer).  Therefore,  the  solution  adopted  was  to  do  the 
chases  in  one  computer  (1,1),  the  fastest  one,  which  at  the 
end  of  a  chase  transmits  the  correspondent  rotations  to  the 
remaining  processors.  Then,  all  processors  update  the  part 
of  the  orthogonal  matrix  allocated  to  them  without  requiring 
any  data  exchange,  i.e.  true  parallelism. 

Figure  17  shows  the  estimated  and  measured  processing 
time  for  a  matrix  of  size  1000.  The  difference  for  grid  (1,4) 
is  mainly  due  to  error  of  the  quadratic  approximation  which 
is  maximum  for  4  processors.  The  estimated  minimum  is 
6  processors;  in  practice  it  is  7  processors.  This  is  due  to 
a  load  imbalance  occurring  for  6  processors,  in  which  there 
is  a  processor  that  takes  2  seconds  more  than  the  others,  as 
shown  in  figure  18. 

The  communications  involved  are  only  to  distribute  the 
Givens  rotations,  estimated  assuming  a  convergence  rate  of 
7,  as: 

7 n2(r  -  1)  (27) 

This  is  an  estimation  because  the  number  of  chases  depends 
on  the  rate  of  convergence  of  the  QR  iteration.  This  rate  is 
expected  to  be  less  than  2  [8].  The  estimated  values  of  figure 
19  were  obtained  with  7  =  0.9  obtained  experimentally 
with  the  matrix  used.  In  this  algorithm  the  communication 
parameters  T#  and  Tl  refer  to  the  machine  that  computes 


Figure  17.  QR  iteration  processing  time  for  a 
matrix  of  size  1000 


Figure  18.  QR  iteration  load  distribution  for  a 
matrix  of  size  1000 


the  Givens  rotations,  since  it  is  the  only  emitter  in  the  QR 
iteration. 

5.4.  Symmetric  eigenvector  computation 

In  this  subsection  the  whole  algorithm  for  eigenvec¬ 
tor  computation  executed  in  the  heterogeneous  machine  is 
compared  to  a  serial  version  [16]  when  executed  in  the 
fastest  node. 

The  performance  metrics  used  to  evaluate  the  parallel 
application  is,  first,  the  runtime,  and  second  the  speedup 
achieved.  To  have  a  fair  comparison  in  terms  of  speedup, 
one  defines  the  Equivalent  Machine  Number  (EMN(p)) 
which  considers  the  power  available  instead  of  the  number 
of  machines  that,  for  a  heterogeneous  environment,  is  an 
ambiguous  information.  Equation  28  defines  EMN(p)  for 
p  processors  used,  and  Si  is  the  computational  capacity  of 
the  processor  that  executed  the  serial  code,  also  called  the 
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Figure  19.  Estimated  (E)  and  measured  (M) 
communications  for  QR  iteration 


master  processor. 

EMN(p )  =  (28) 

For  the  machine  presented  in  figure  3  EM  N  (6)  =  3.77 
and  EMN( 7)  =  3.97,  i.e.  using  6  processors  of  the  hetero¬ 
geneous  machine  is  equivalent  to  3.77  processors  identical 
to  the  master  processor  and  to  3.97  if  7  processors  are  used. 

Figure  20  and  table  3  compare  the  virtual  machine  to  the 
fastest  node  of  the  machine  used  to  run  the  sequential  code. 
Different  grid  configurations  are  used  for  the  different  algo¬ 
rithms,  according  to  the  optimal  grid  computed  by  equation 
21. 


400  600  800  1000  1200  1400  1600 


Matrix  size  (n) 


Figure  20.  Eigenvector  computation  in  a  7  pro¬ 
cessor  heterogeneous  machine  compared 
to  the  sequential  algorithm  executed  in  the 
fastest  node 


Stage 

Number  of  processors  used  | 

GRID 

(n) 

400 

600 

800  1000  1200 

1400 

1600 

(pxq) 

TRD 

1 

2 

3 

3 

4 

4 

4 

Ixq 

Q  (Orth.) 

5 

6 

6 

6 

7 

7 

7 

Ixq 

QR  it 

5 

6 

6 

6 

6 

6 

6 

pxl 

Speedup 

1.0 

1.7 

2.3 

2.6 

2.9 

3.0 

3.1 

EMN 

3.6 

3.8 

3.8 

3.8 

4.0 

4.0 

4.0 

Efficiency 

0.3 

0.5 

0.6 

0.7 

0.7 

0.8 

0.8' 

Table  3.  Processors  used  in  each  stage  of  the 
eigenvector  computation 


6.  Conclusions 

Briefly  stated,  the  methodology  presented  in  this  paper 
was  designed  to  address  problems  arising  in  the  context  of 
using  image  processing  and  analysis  algorithms  for  inter¬ 
actively  extracting  important  data  and  information  from  im¬ 
ages  of  a  specific  application  domain,  e.g.  medical  imaging. 

Currently,  this  activity  is  often  conducted  by  exploring 
the  functionality  (hardware  and  software)  of  general  pur¬ 
pose  systems,  which  usually  trade  off  algorithm  sophistica¬ 
tion  and  user  comfort;  this  means  that  more  advanced  image 
tools  may  be  absent  in  these  systems  due  to  practical  con¬ 
siderations. 

The  main  goal  of  the  work  herein  presented  was  to  take 
advantage  of  the  existence  of  a  network  of  computers  (this  is 
a  very  frequent  situation  in  many  user  organizations)  to  try 
and  move  the  aforementioned  trade-off  in  the  direction  of 
allowing  the  provision  of  more  advanced  and  sophisticated 
algorithms  without  sacrificing  user  comfort. 

The  results  presented  show  that,  for  the  important  linear 
algebra  building  blocks  of  many  advanced  image  analysis 
methods,  the  stated  goal  may  be  accomplished;  an  improve¬ 
ment  has  been  achieved  in  the  execution  time,  by  a  factor  of 
about  3,  which  may  bring  more  image  analysis  tools  into 
the  feasible  condition  for  new  general-purpose  software. 

A  collection  of  machines  with  a  wide  range  of  process¬ 
ing  capacities,  from  244  to  49  Mflops  in  the  case  presented, 
can  cooperate  and  achieve  a  considerable  speedup  in  linear 
algebra  algorithms.  The  load  balancing  strategy  proved  to 
be  a  determinant  condition  for  the  quality  of  the  results. 

A  methodology  to  determine  in  a  computer  network  the 
number  of  active  processors  that  minimizes  the  total  pro¬ 
cessing  time  for  a  specific  parallelized  algorithm  was  pre¬ 
sented.  The  main  objective  is  that  the  user  of  a  computa¬ 
tionally  demanding  application  may  benefit  from  the  com¬ 
putational  power  distributed  over  the  network,  while  keep¬ 
ing  other  active  users  undisturbed. 

This  goal  can  be  achieved  in  a  transparent  manner  for  the 
user,  once  the  modules  of  his/her  application  are  correctly 


158 


parallelized  for  the  target  network  and  the  performance  of 
the  machines  in  the  network  is  known.  The  application,  be¬ 
fore  initiating  a  parallel  module,  determines  the  best  avail¬ 
able  computer  composition  for  a  parallel  virtual  computer  to 
execute  it,  and  then  launches  the  module,  achieving  the  best 
response  time  possible  in  the  actual  network  conditions. 

Practical  tests  of  the  methodology  were  conducted  both 
on  homogeneous  and  heterogeneous  networks,  using  basic 
algorithms  from  linear  algebra;  in  both  cases,  the  theoreti¬ 
cal  values  computed  were  confirmed  by  the  measured  per¬ 
formance.  It  was  shown  that  a  good  load  balancing  could 
be  achieved  even  for  a  heterogeneous  environment,  by  us¬ 
ing  an  appropriate  processor  layout.  Other  generic  modules 
will  be  parallelized  and  tested,  so  that  an  ever  increasing 
number  of  image  analysis  methods  may  be  assembled  from 
them.  Application  domains  other  than  image  analysis  may 
also  benefit  from  the  proposed  methodology. 
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Abstract 

Task  assignment  and  scheduling  algorithms  for  Hetero¬ 
geneous  computing  systems  can  be  classified  as  iterative 
and  non-iterative  techniques ,  and  are  designed  to  optimize 
a  specific  cost  function  defined  on  the  system.  The  quality 
of  the  solutions  generated  is  controlled  by  the  nature  of  this 
cost  metric .  The  common  metrics  that  are  used  include  min¬ 
imizing  the  overall  execution  time  or  minimizing  the  load  on 
the  maximum  loaded  processor.  In  this  work ,  a  new  set  of 
cost  metrics  have  been  proposed  that  can  be  used  by  itera¬ 
tive  task  assignment  algorithms.  These  metrics  exploit  the 
fact  that  in  iterative  algorithms  the  mapping  of  the  subtasks 
to  the  processors  is  known  at  every  iteration.  They  reflect 
the  actual  scheduling  cost  of  the  application ,  thereby  im¬ 
proving  the  quality  of  the  solutions  generated  by  the  algo¬ 
rithm.  The  proposed  metrics  are  evaluated  using  the  learn¬ 
ing  automata  based  iterative  algorithm  [15].  Observations 
are  made  regarding  the  nature  of  the  metrics  from  the  re¬ 
sults  obtained. 

Key  Words:  Task  assignment  and  scheduling ,  Heteroge¬ 
neous  computing ,  Cost  function. 


1  Introduction 

Efficient  task  assignment  and  scheduling  is  critical  to 
achieving  high  performance  in  Heterogeneous  Comput- 
ing(HC)  systems  [2,  10].  In  these  systems,  applications  are 
represented  as  a  directed  acyclic  graph  called  the  task  flow 
graph(TFG),  and  the  processing  resources  are  represented 
as  a  directed  graph  called  the  processor  graph(PG).  The 
purpose  of  scheduling  is  to  map  the  tasks  to  the  available 
processors  and  order  their  execution,  so  that  the  task  prece¬ 
dence  requirements  are  satisfied  and  the  schedule  length  is 
minimized.  It  has  been  shown  that  the  scheduling  problem 
in  general  is  an  NP-complete  problem  [14],  and  hence  a 


number  of  heuristic  algorithms  have  been  proposed  to  solve 
it. 

These  algorithms  can  be  broadly  classified  as  iterative 
and  non-iterative  algorithms.  Proposed  works  in  the  for¬ 
mer  category  include  [12,  13,  15,  17],  and  the  algorithms  in 
the  latter  are  [1,  3,  4,  5,  6,  9,  1 1,  16].  The  non-iterative  al¬ 
gorithms  work  by  exploiting  the  graph-theoretic  properties 
of  the  TFG  to  generate  a  solution  that  optimizes  a  specific 
cost  function.  The  iterative  algorithms  on  the  other  hand, 
proceed  by  generating  an  initial  random  solution  and  then 
progressively  improving  it,  subject  once  again  to  optimiz¬ 
ing  the  cost  criterion.  Due  to  the  difference  in  approach 
of  the  two  classes  of  algorithms,  the  influence  of  the  cost 
metric  on  their  ability  to  generate  efficient  solutions  varies. 
But,  traditionally,  generic  cost  metrics  like  minimization  of 
overall  execution  time  or  minimization  of  the  load  on  the 
maximum  loaded  processor  have  been  used  for  both  classes 
of  algorithms.  In  this  work,  we  propose  a  new  set  of  cost 
metrics  that  are  applicable  to  the  iterative  algorithms.  The 
proposed  metrics  generate  solutions  that  are  closer  to  the 
actual  schedule  time. 

The  material  in  this  paper  is  organized  as  follows.  Sec¬ 
tion  2  begins  with  the  required  preliminaries  and  explains 
the  system  model  within  which  the  metrics  have  been  de¬ 
fined.  The  next  section  describes  the  proposed  set  of  cost 
functions.  Section  4  evaluates  these  functions  and  makes 
observations  about  the  efficiency  based  on  the  results  ob¬ 
tained.  The  last  section  concludes  the  work. 

2  Preliminaries 

This  section  introduces  the  required  preliminaries  and 
describes  the  system  model  within  which  the  proposed  cost 
metrics  have  been  defined. 

It  is  assumed  that  the  application  has  been  partitioned 
into  subtasks  and  modeled  by  means  of  a  directed  acyclic 
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graph  called  the  task  flow  graph.  The  nodes  of  the  TFG 
correspond  to  the  subtasks  and  the  edges  represent  the  data- 
dependencies  between  them.  The  nodes  are  represented  by 
the  set  S  and  the  edges  by  the  set  ETFG .  Hence, 

S  =  {s*,  0  <  z  <  \S\}  and 

ETFG  —  {(«,  j)  /  Si,Sj  £  S  and  Si  dependent  on  Sj} 

Every  directed  edge  in  the  graph  indicates  the  flow  of  data 
from  one  subtask  to  another.  The  edges  are  assigned  a 
weight  that  corresponds  to  the  number  of  data-units  ex¬ 
changed  between  the  corresponding  pair  of  subtasks.  If 
effG  represents  the  edge  weight,  then: 

{#  of  data  units  excha¬ 
nged  between  S{  and  Sj ,  if  (i\j)  £  ETFG\ 

0?  otherwise. 

The  processor  configuration  is  assumed  to  be  modeled  as 
a  directed  graph  called  the  processor  graph.  Here,  the  nodes 
correspond  to  the  processors  and  the  edges  to  the  communi¬ 
cation  links  between  them.  Let  M  represent  the  set  of  nodes 
and  Epg  the  set  of  edges,  then: 

M  —  {m*, 0  <  i  <  \M\}  and 
EPG  =  {{hj)  I  mi,mj  £  M  and  mi  is  connected 
to  nrij} 

Here  again,  the  edges  which  indicate  whether  or  not  a  set 
of  processors  are  connected,  are  assigned  weights  that  cor¬ 
respond  to  the  cost  of  communicating  a  single  unit  of  data 
from  one  processor  to  another.  Let  eff  represent  the  edge 
weight,  then: 

{cost  of  communicating 
a  data  unit  between  m,  if  (z,  j)  £  EPG\ 
and  mj , 

oo,  otherwise. 

In  addition  to  this  information,  it  is  assumed  that  the 
cost  of  executing  each  of  the  subtasks  on  the  processors  are 
known.  These  values  are  stored  as  a  matrix  called  EJT.  The 
matrix  can  be  represented  as: 

E.T  =  { eJ(i,j ),  0  <  i  <  |S|,  0  <  j  <  \M\}. 
e-t(i ,  j)  =  execution  time  of  subtask  Si  on  machine 

Vfly 

Since  in  this  work  we  are  concerned  only  with  the  iter¬ 
ative  algorithms,  there  should  be  a  means  of  representing 
the  solution  generated  at  every  iteration.  In  general,  the  so¬ 
lution  can  be  conceived  as  a  mapping,  7r,  from  the  set  of 
subtasks  to  the  set  of  processors. 

7r  :  S  ^  M. 

Let  ‘n’  represent  the  iteration  number.  Then  the  solution 
generated  at  iteration^’  can  be  represented  as  7rn  (z),  where: 
7rn(z)  indicates  the  machine  ‘m'  to  which  subtask 
is  assigned  to  at  iteration  ‘n'. 


The  representation  of  the  system  model  is  now  complete 
and  the  cost  functions  can  be  defined  on  the  system.  This 
forms  the  subject  of  the  next  section. 

3  Proposed  Cost  Metrics 

The  objective  of  iterative  task  assignment  algorithms  is 
to  explore  the  solution  space  efficiently  in  order  to  seek  the 
global  optimal  solution.  The  solution  space  is  character¬ 
ized  by  the  cost  function  defined  on  the  system,  and  hence 
becomes  critical  to  determining  its  performance.  In  these 
assignment  algorithms,  the  process  of  determining  the  final 
solution  proceeds  by  initially  generating  a  random  solution. 
This  is  then  evaluated  for  its  merit,  based  on  which  a  new 
improved  solution  is  generated  in  the  next  iteration.  The 
process  is  repeated  until  the  solution  converges,  or  in  other 
words  when  there  is  no  further  improvement  in  the  quality 
of  the  solution.  Hence,  at  each  iteration  of  the  algorithm, 
the  mapping  of  the  subtasks  to  the  machines  is  known.  In 
previous  works,  this  information  is  neglected  when  trying 
to  determine  the  solution  to  the  assignment  problem.  But  it 
can  be  utilized  to  explore  the  solution  space  more  efficiently 
and  bring  the  final  solution  closer  to  the  actual  scheduling 
cost.  In  this  work,  we  propose  precisely  such  a  set  of  met¬ 
rics. 

To  begin  with,  let  us  define  terminologies  that  will  help 
develop  the  cost  metrics.  For  each  node  Si  £  S  in  the  TFG 
we  associate  three  schedule  times,  for  each  of  the  machines 
mj  £  M  a  machine  start  time,  and  arrays  of  nodes.  These 
are  defined  as: 

MST\j ,  /]  -»  the  machine  start  time  for  machine  at 
the  beginning  of  level  T. 

EST[i]  -»  the  earliest  time  at  which  the  subtask  Si  can 
begin  its  execution. 

WT[i\  the  amount  of  time  the  subtask  Si  has  to  wait 

before  it  can  begin  its  execution. 

CT[z]  — >  the  time  at  which  the  subtask  Si  completes  its 
execution. 

predi\\  -»  an  array  consisting  of  the  predecessor  nodes 
of  subtask  S{. 

order jj  0  ->  an  array  that  specifies  the  order  in  which 
the  subtasks  in  level  T  and  assigned  to  machine  mj  are  ex¬ 
ecuted. 

TTj,i  ->  represents  the  number  of  subtasks  in  order ^  Q. 

It  can  be  readily  inferred  that  for  any  node  $i  in  the  TFG, 
the  completion  time  CT[i\  at  iteration  V  can  be  computed 
as: 

CT[i\  =  EST\i |  +  WT[i]  +  eJ(z>n(z)) 

In  order  to  compute  MST ,  EST  and  WT  of  the  nodes 
however,  the  information  about  the  structure  of  the  TFG  is 
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needed.  This  is  achieved  by  leveling  the  TFG.  The  levell¬ 
ing  process  is  similar  to  the  levelized  min  time  heuristic 
proposed  in  [7].  The  root  node  or  nodes  are  first  assigned, 
‘level  O’.  It’s  successor  nodes,  if  all  of  their  predecessors 
are  in  ‘ level  O’,  are  assigned  ‘ level  V.  The  next  successor 
nodes,  if  all  of  their  predecessors  are  either  in  'level  0’  or 
‘level  1’,  are  assigned  ‘level  2’  and  so  on.  Finally  the  leaf 
nodes  are  assigned  a  level  number  equal  to  the  height  of  the 
TFG.  Now,  the  machine  start  time,  earliest  start  time  and 
the  wait  times  of  the  nodes  can  be  computed. 


The  machine  start  time  M ST,  refers  to  the  time  at  which 
a  particular  machine  can  begin  executing  tasks  from  a  par¬ 
ticular  level.  This  is  important,  since  nodes  or  tasks  at  a  later 
level  have  to  wait  until  all  the  tasks  in  previous  levels  have 
completed  execution  on  the  machines  they  were  assigned 
to.  Therefore  we  have: 


r°> 


if  l  =  0; 


MST[j,l] 


{{CT[arderjti-i[k]]},  otherwise. 


Assume  that  a  subtask  Si  has  ‘p’  predecessor  nodes,  rep¬ 
resented  as  defined  previously  by  the  array  predi[k]  with 
0  <  k  <  p  —  1.  Then, 

(  0,  ifp  =  0; 

EST[i]  =  { 

[  MaxpkJ0  X(k),  otherwise, 
where  X(k )  =  Max{  {CT\predi[k}}  -I-  * 

To  complete  the  calculation  of  the  completion  time  for 
each  of  the  subtasks,  the  wait  time  has  to  be  defined.  The 
wait  time  determines  when  a  task  will  begin  its  execution 
and  hence  determines  the  efficiency  of  the  cost  metrics.  If  a 
subtask  is  the  only  task  that  has  been  mapped  to  a  particular 
processor,  then  it  does  not  have  to  wait  to  begin  its  execu¬ 
tion.  It’s  wait  time  therefore  will  be  equal  to  zero.  But  if 
more  than  one  subtask  is  mapped  to  the  same  machine,  then 
it’s  possible  to  order  their  execution  so  that  a  more  optimal 
solution  can  be  obtained.  Since  the  TFG  has  task  prece¬ 
dence  constraints,  only  the  subtasks  in  the  same  level  can 
be  considered  for  this  ordering.  The  ordering  of  the  sub¬ 
tasks  determines  the  wait  time  for  each  of  them.  Here,  three 
different  orderings  of  the  subtasks  that  result  in  three  cost 
metrics,  named  CM.l,  CM. 2,  CM. 3  are  proposed. 


Cost  Metric  CM.l: 

The  subtasks  from  the  same  level  and  assigned  to  a  partic¬ 
ular  machine,  are  executed  in  the  non-decreasing  order  of 
their  earliest  start  times.  For  instance,  if  we  have  three  sub¬ 


tasks  si,S2  and  s3,  and  assume  that  EST[l]  <  EST[ 2]  < 
EST[ 3],  then  the  subtask  sj  will  be  executed  first,  followed 
by  S2  and  then  s3. 

Cost  Metric  (7M-2: 

In  the  second  cost  metric,  the  subtasks  are  executed  in  the 
non-decreasing  order  of  their  expected  execution  times.  For 
the  aforementioned  three  tasks  if  we  assume  that  e_f(3,  j)  < 
e.t(2,j)  <  eJ(l,j),  where  lm/  is  the  machine  to  which 
they  are  assigned,  then  the  subtask  s3  is  executed  first,  fol¬ 
lowed  by  S2  and  si . 

Cost  Metric  CM, 3: 

The  last  cost  metric  that  is  proposed  here,  in  a  way  com¬ 
bines  the  ideas  of  the  previous  two  cost  metrics.  It  orders 
the  subtasks  in  the  non-decreasing  order  of  the  sum  of  the 
earliest  execution  time  and  the  expected  execution  time  of 
the  subtasks. 


The  difference  in  these  cost  metrics  can  be  clearly  un¬ 
derstood  by  means  of  an  illustrative  example.  Assume  that 
three  subtasks  si ,  s2  and  s3  belong  to  the  same  level  in  a 
TFG,  and  are  assigned  to  the  machine  mj.  Let  their  earliest 
start  times  and  expected  execution  times  be: 

EST[  1]  =  4,  and  e.t[l,j]  =  11  timeunits. 

EST[ 2]  =  7,  and  e.t[2,j]  =  6  timeunits. 

EST[ 3]  =  18,  and  e.t[3,j ]  =  3  timeunits. 
Figure  1  shows  the  schedule  times  for  the  three  subtasks 
based  on  the  ordering  of  CM.  1.  Here,  the  maximum  of 
the  completion  times  amongst  the  subtasks  is  24  time  units. 
The  schedule  times  corresponding  to  the  order  of  CM. 2 
is  shown  in  Figure  2.  For  this  metric  the  maximum  of  the 
completion  times  is  38  time  units.  The  final  metric  C M. 3, 
results  in  the  schedule  times  shown  in  Figure  3.  Here,  the 
maximum  of  the  completion  times  is  27  time  units.  There¬ 
fore,  for  the  example  case,  the  ordering  of  CM  A  offers  the 
best  solution  as  it  results  in  the  minimum  schedule  time(24), 
amongst  the  three  metrics. 


The  different  orderings  of  the  subtasks  represented  by 
the  metrics  can  be  incorporated  into  the  generic  cost  metric 
by  means  of  the  wait  time.  The  array,  order jj 0,  specifies 
the  ordering  of  the  subtasks  in  level  l  and  assigned  to  ma¬ 
chine  mj,  as  shown  in  the  initial  definitions.  Now  the  wait 
time  can  be  generically  defined  as: 


0, 


if  k  =  0; 


WT[orderj}i[k]] 


{  ( Mod(CT[k  -  1]  - 
1  EST[k ])  +  (CT[k  otherwise. 
I  -1]  -  EST[k]))  /  2, 


where  0  <  k  <  TTjj. 


162 


It  can  be  observed  that  the  proposed  metrics  are  very 
similar  to  scheduling  heuristics  proposed  previously  in  the 
literature,  and  can  be  easily  confused  with  them.  But  there 
are  important  distinctions  between  the  cost  metrics  pro¬ 
posed  in  this  work  and  these  heuristics.  A  scheduling  heur- 
sitic  works  on  a  fixed  assignment  and  tries  to  generate  an 
optimal  schedule  of  the  mapped  tasks  so  that  the  overall 
completion  time  of  the  application  is  minimized.  In  our  ap¬ 
proach  on  the  other  hand,  the  cost  metric  is  used  to  as  a 
measure  of  efficiency  for  the  solutions  generated.  This  is 
then  used  by  the  iterative  algorithm  to  move  towards  a  bet¬ 
ter  and  more  optimal  solution  for  the  assignment  problem. 
There  is  also  a  clear  distinction  between  dynamic  schedul¬ 
ing  algorithms  proposed  in  [8]  and  the  metrics  presented  in 
this  work.  The  dynamic  scheduling  algorithms  begin  with 
an  intial  assignment  and  attempt  to  determine  the  optimal 
remapping  for  each  of  the  subtasks  in  turn  by  using  the 
information  on  subtasks  that  have  already  completed  exe¬ 
cution  and  those  that  need  to  be  executed.  The  cost  met¬ 
rics  CM.  1,  CM. 2  and  CM. 3,  at  every  iteration  consider 
the  entire  mapping  of  subtasks  to  the  various  machines  and 
compute  a  cost  that  is  as  close  as  possible  to  the  actual 
scheduling  cost.  This  metric  is  then  exploited  by  the  algo¬ 
rithm  to  explore  the  solution  space  more  efficiently.  Hence, 
although  the  construction  of  the  proposed  metrics  are  simi¬ 
lar  to  scheduling  heuristics,  there  are  important  distinctions 
between  them. 

At  every  iteration  of  any  iterative  algorithm  therefore, 
the  completion  times  of  each  of  the  subtasks  can  be  com¬ 
puted  using  one  of  the  three  proposed  cost  metrics.  The 
objective  of  the  algorithm  would  then  be  to  minimize  the 
completion  time  of  the  subtask  that  has  the  maximum  com¬ 
pletion  time.  The  next  section  present  the  results  that  were 
obtained  using  these  cost  metrics. 

4  Results  and  Observations 

To  evaluate  the  proposed  cost  metrics,  the  learning  au¬ 
tomata  based  iterative  algorithm  [15]  is  used,  though  the 
metrics  can  be  used  by  any  iterative  assignment  algorithm. 
The  primary  reason  for  using  this  algorithm  is  because  it 
can  be  adapted  for  any  user  specified  cost  metric  without 
requiring  a  change  in  the  construction  of  the  algorithm.  A 
short  description  of  the  algorithm  is  presented  first. 

The  task  assignment  algorithm  proposed  in  [15]  works 
on  a  framework  consisting  of  a  HC  system  model  and  a 
learning  automata  model.  The  system  model  abstracts  the 
application  as  a  TFG  and  the  suite  of  machines  as  a  PG, 
similar  to  the  model  presented  in  this  work.  The  algorithm 
can  be  adapted  to  work  for  any  cost  metric  that  can  be  de¬ 
fined  on  the  system  by  the  user.  This  feature  is  realized 
by  means  of  the  learning  automata  model.  It  is  constructed 


by  associatong  every  task  in  the  TFG  with  a  variable  struc¬ 
ture  stochastic  automaton.  The  HC  system  model  serves  as 
the  external  environment  for  these  automata.  Six  heuristics 
were  investigated  to  construct  the  learning  algorithm.  The 
best  of  these  heuristics  is  used  in  this  work. 

The  simulation  environment  consists  of  the  task  flow 
graph  and  the  processor  graph  which  are  generated  at  ran¬ 
dom.  The  edge  weights  of  the  graphs  are  also  assumed  to  be 
generated  at  random  with  equal  probability  over  some  pre¬ 
defined  ranges.  The  values  for  these  ranges  are  presented 
in  Table  1.  It  is  assumed  that  the  iterations  of  the  algorithm 
are  continued  until  the  probability  of  the  actions  of  the  au¬ 
tomata  reach  0.99,  or  if  the  number  of  iterations  reaches  a 
specified  user  limit.  In  all  the  experiments  that  were  con¬ 
ducted,  the  algorithm  terminated  due  to  the  former  reason 
indicating  that  the  solutions  were  convergent. 

For  the  first  set  of  experiments,  the  communication  com¬ 
plexity  was  assumed  to  be  low.  In  other  words,  the  num¬ 
ber  of  edges  in  the  TFG  is  equal  to  one-third  the  number 
of  tasks.  The  number  of  processors  were  varied  between 
2,5,10  and  20.  The  results  obtained  are  presented  in  Fig¬ 
ures  4  -  7.  It  can  be  observed  from  the  graphs  that  the  costs 
generated  by  the  metrics  differ  in  their  optimality.  They  can 
therefore  be  used  to  deliver  better  solutions  to  the  assign¬ 
ment  problem. 

The  second  set  of  experiments  was  conducted  with  a 
medium  communication  complexity,  where  the  number  of 
edges  were  equal  to  two-thirds  that  of  the  tasks.  These  re¬ 
sults  are  presented  in  Figures  8  -  1 1.  A  similar  observation 
as  the  first  set  of  experiments  can  be  made  here,  proving 
once  again  the  utility  of  the  metrics  proposed.  From  both 
sets  of  experiments,  it  can  be  seen  that  when  the  communi¬ 
cation  complexity  increases  it  leads  to  an  increased  differ¬ 
ence  between  the  solutions  generated  by  the  metrics.  The 
reason  for  this  is  that  when  the  communication  complexity 
increases  the  number  of  tasks  being  assigned  to  the  same 
level  also  increases.  Hence  the  different  orderings  of  the 
proposed  metrics  have  a  greater  impact  on  the  solutions 
generated. 

On  the  same  lines,  in  both  sets  of  experiments  when  the 
number  of  tasks  are  low,  the  costs  generated  are  equal  be¬ 
tween  the  different  metrics.  This  is  due  to  the  fact  that  very 
few  of  the  tasks  are  at  the  same  level  number  when  the  total 
number  of  tasks  are  low  and  hence  the  ordering  of  the  sub¬ 
tasks  does  not  have  a  big  impact  on  the  solutions  generated. 

In  general  it  can  be  seen  that  the  proposed  metrics  are 
affected  by  the  number  of  subtasks  in  the  TFG,  their  data 
dependencies  represented  as  the  communication  complex¬ 
ity,  and  the  number  of  processors  available  for  scheduling. 
Since  all  these  factors  directly  affect  the  actual  schedul¬ 
ing  cost  of  an  application  task  executing  in  the  HC  system, 
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Number  of  tasks 

10,25,50,75  and  100 

Number  of  machines 

2,5,10  and  20 

Number  of  edges 

|S|/3  and  2|S|/3 

Expected  execution  time  range 

1000 

Communication  time  range 

4 

TFG  edge  weight  range 

500 

Table  1.  Parameters  for  generating  TFG’s  and 
PG’s 


the  proposed  metrics  can  help  in  providing  better  solutions. 
Since  in  the  results  shown,  the  graphs,  the  expected  exe¬ 
cution  time  and  communication  time  are  all  generated  at 
random,  the  difference  in  efficiency  of  these  metrics  is  not 
discernible. 

5  Conclusions 

A  new  set  of  cost  metrics  for  iterative  task  assignment  al¬ 
gorithms  in  HC  systems  were  proposed.  The  metrics  were 
developed  by  exploiting  the  fact  that  in  iterative  algorithms 
the  mapping  of  the  subtasks  to  the  processors  is  known  at 
all  iterations.  The  proposed  functions  were  evaluated  using 
the  learning  automata  based  iterative  algorithm  [15].  The 
results  obtained  show  that  when  the  number  of  tasks  in  the 
TFG  is  low  or  when  the  communication  complexity  is  low, 
there  is  not  much  of  difference  in  the  costs  generated  by 
the  metrics.  This  difference  increased  when  the  communi¬ 
cation  complexity  was  increased.  Since  the  performance  of 
the  proposed  metrics  depend  on  the  factors  that  affect  the 
scheduling  cost,  they  reflect  the  actual  scheduling  time  of 
the  application.  Therefore  they  can  be  used  to  improve  the 
quality  of  solutions  for  the  task  assignment  problem  in  het¬ 
erogeneous  computing  systems. 
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Abstract 

In  a  serverless  cluster  of  PCs  or  workstations,  the 
cluster  must  allow  remote  file  accesses  or  parallel  I/O 
directly  performed  over  disks  distributed  to  all  client 
nodes.  We  introduce  a  new  distributed  disk  array,  called 
the  RAID-x,  for  use  in  serverless  clusters.  The  RAID-x 
architecture  is  based  on  an  orthogonal  striping  and 
mirroring  (OSM)  scheme,  which  exploits  full-bandwidth 
and  protects  the  system  from  all  single  disk  failures. 

The  performance  of  the  RAID-x  is  experimentally 
proven  superior  to  RAID-1  and  NFS  in  the  Linux  cluster 
environment.  We  propose  a  new  striped  checkpointing 
scheme,  leveraging  on  striped  parallelism  and  pipelined 
writing  of  successive  disk  stripes.  This  RAID-x 
architecture  greatly  enhances  the  throughput,  reliability, 
and  availability  of  scalable  clusters.  It  appeals  especially 
to  I/O-centric  cluster  applications. 

Keywords:  Scalable  computing,  RAID  architectures, 
parallel  I/O,  Linux  clusters,  disk  mirroring,  single 
system  image,  checkpointing,  staggered  writing,  and 
fault  tolerance 


1.  Introduction 

Many  redundant  arrays  of  inexpensive  disks  (RAID)  [6] 
use  independent  disks  under  the  control  of  a  single  or 
multiple  controllers.  The  TickerTAIP  [3]  pioneered  the 
Parallel  RAID  architecture  for  supporting  parallel  disk  I/O 
with  multiple  controllers.  Still,  these  parallel  disk  arrays 
are  implemented  as  a  centralized  I/O  subsystem.  These 
RAID  subsystems  are  often  attached  to  a  storage  server  or 
used  as  network-attached  disks  [10], 


For  this  reason,  we  consider  the  classic  disk  arrays  as  a 
centralized  RAID.  In  contrast,  this  paper  deals  only  with 
distributed  RAID  architectures.  This  concept  was 
investigated  by  Stonebraker  and  Schloss  [25].  The  actual 
prototyping  of  distributed  RAIDs  did  not  start  until  the 
Petal  [17]  and  the  Tertiary  Disk  project  [26]. 

A  distributed  RAID  is  constructed  out  of  dispersed 
disks,  which  are  physically  attached  to  different  computer 
hosts  through  the  network  connections.  The  Petal  was  built 
with  a  chained  declustering  [12].  The  Tertiary  Disk  was 
built  with  a  RAID-5  architecture  using  software  support  by 
the  serverless  xFS  file  system  [2]. 

The  architecture  and  performance  of  a  new  distributed 
RAID  architecture,  namely  the  RAID-x,  are  reported  here. 
The  level  x  is  yet  to  be  rectified  with  an  appropriate  code 
assignment  by  the  RAID  Advisory  Board  [22].  Our  RAID- 
x  differs  from  existing  distributed  RAID  architectures  in 
many  aspects. 

First,  the  RAID-x  is  built  with  a  new  disk  mirroring 
technique,  called  orthogonal  striping  and  mirroring 
(OSM).  The  small  write  problem  associated  with  RAID-5 
is  completely  eliminated  in  this  OSM  approach.  Second, 
we  use  cooperative  disks  instead  of  independent  disks. 

To  enable  true  cooperation  among  dispersed  disks,  we 
have  developed  cooperative  disk  drivers  (CDD)  at  the 
kernel  level.  Data  consistency  is  maintained  inside  the 
CDD,  instead  of  using  a  central  network  file  system. 
Therefore,  unmodified  file  system  interface  is  available  to 
users.  Third,  the  RAID-x  was  specially  designed  over 
distributed  disks  for  I/O-centric  cluster  computing. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2 
describes  the  Trojans  cluster  architecture  and  also  presents 
an  overview  of  distributed  RAID  architectures.  Our 
RAID-x  approach  is  compared  with  the  architectural 
designs  in  Berkeley  Tertiary  Disks  running  the  xFS, 
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Digital  Petal  system  and  Princeton  TickerTAIP  parallel 
RAID  system.  Section  3  introduces  the  OSM  scheme  and 
the  RAID-x  architecture.  We  also  compare  RAID-x  with 
RAID-1  for  designing  distributed  disk  arrays.  Section  4 
describes  the  architecture  of  the  cooperative  disk  drivers 
and  data  consistency  checking  mechanisms. 

Section  5  presents  the  benchmark  performance  results 
obtained  on  the  Trojans  cluster.  Section  6  explains  the 
striped  staggering  checkpointing  scheme  we  developed  on 
top  of  RAID-x.  Section  7  gives  out  the  preliminary 
experimental  results  on  striped  checkpointing  overhead 
and  the  analysis  of  reliability  issue  of  proposed 
checkpointing  scheme.  Section  8  summaries  the 
contributions  and  identifies  extended  research  work. 


2.  USC  Trojans  Cluster  Architecture 

The  prototype  Trojans  cluster  was  built  with  16 
Pentium  PCs  (Pentium  II  400MHz)  running  the  Linux 
operating  system  (Redhat  Linux  6.0  with  kernel  2.2.5). 
These  PC  nodes  are  connected  by  a  100  Mbps  Fast 
Ethernet  switch. 

At  present,  each  node  is  attached  with  a  10-GB  disk. 
With  16  nodes,  the  total  capacity  of  the  disk  array  is  160 
GB.  All  16  disks  form  a  single  I/O  space.  Figure  la  shows 
the  front  view  of  the  prototype  Trojans  cluster.  This 
cluster  is  connected  to  Internet  over  fiber  links. 

As  illustrated  in  Fig. lb,  we  subdivide  the  cluster  nodes 
into  three  functional  classes.  The  entry  partition  is  for  the 
user  to  access  the  cluster  through  Internet/Intranet.  Nodes 
in  the  service  partition  provide  the  services  requested  by 
users.  The  database  partition  supports  database  or 
information  accesses  operations.  Nodes  in  the  three 
partitions  can  be  dynamically  reconfigured  to  suit  special 
application  demands. 

To  build  a  distributed  RAID  with  a  SIOS,  our  research 
objectives  are  identified  in  three  aspects:  (i)  A  single 
address  space  for  all  data  blocks  in  the  cluster.  This  means 
that  the  users  can  utilize  all  disk  storage  in  a  cluster 
without  knowing  the  physical  locations  of  the  data  blocks 
referenced  or  of  the  files  used,  (ii)  High  scalability, 
availability,  and  compatibility  with  current  cluster 
architectures  and  applications  must  be  maintained,  (iii) 
Remote  disk  I/O  operations  should  have  performance  at 
least  comparable  to  that  of  local  disk  I/O  operations. 

Previous  approaches  to  achieve  SIOS  were  attempted  at 
the  user  level,  file-system  level,  and  device-driver  level. 
The  user-level  approach  has  the  lowest  cost  and  higher 


portability  across  different  platforms.  The  Parallel  Virtual 
File  System  (PVFS)  [18]  and  the  Remote  I/O  project  [9] 
are  two  examples.  However,  this  approach  does  introduce 
two  problems:  First,  users  still  have  to  use  specific  APIs 
and  identifiers  to  exploit  full  functionality  of  the  packages. 
Second,  using  system  calls  to  perform  network  and  file  I/O 
are  too  expensive  to  meet  real-time  or  cluster  computing 
requirements. 


(a)  A  front-view  of  the  Trojans  Cluster 


(b)  I/O-centric  cluster  architecture 

Fig.  1.  Trojans  cluster  built  at  USC  Internet  and 
Cluster  Computing  Laboratory 
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Distributed  file  systems  provide  another  approach  to 
achieving  SIOS  to  the  users.  Users  can  access  remote  data 
as  if  it  is  accessed  locally.  The  serverless  xFS  system 
developed  at  Berkeley  [2]  and  the  Solaris  MC  project  are 
good  examples.  However,  this  approach  has  its  own 
shortcomings. 

Changing  the  file  system  does  not  guarantee  high 
compatibility  with  current  applications.  This  will 
discourage  the  deployment  of  the  distributed  file  systems 
in  clusters.  What  we  want  to  achieve  is  a  SIOS  with  an 
unmodified  file  system  to  achieve  high  portability  with  a 
low  cost/performance  ratio. 

Device-driver  level  designs  provide  SIOS  not  only  to 
the  users,  but  also  to  the  file  system.  We  choose  this 
approach,  because  it  solves  most  of  the  above  problems 
and  shortcomings.  Digital  Petal  project  [17]  uses  user 
level  device  driver  design  to  enable  remote  I/O  access. 

All  physically  distributed  disks  can  be  viewed  as  a 
collection  of  virtual  disks.  Each  virtual  disk  can  be 
accessed  as  if  it  is  a  local  disk.  Petal  developed  a 
distributed  file  system,  called  Frangipani  [27].  In  Petal,  the 
actual  data  transfer  is  handled  at  the  user  level. 

We  have  developed  Cooperative  Device  Drivers 
(CDD).  These  drivers  work  cooperatively  at  the  kernel 
level.  Data  consistency  is  maintained  by  the  CDD. 
Unmodified  file  system  is  used  to  achieve  high  portability 
and  compatibility. 

The  development  of  the  RAID-x  architecture  was 
inspired  by  previous  projects.  The  pioneering  RAID  work 
at  Berkeley  [2]  [6]  [8]  and  at  CMU  [10],  the  TickerTAIP 
project  [3],  the  Tertiary  Disk  project  [25],  chained 
declustering  [12],  and  Petal  project  [17]  all  have 


influenced  our  design  philosophy. 

Our  RAID-x  design  appeals  especially  to  serverless 
clusters.  The  major  innovation  in  our  design  lies  in  the 
cooperation  of  distributed  disks  in  a  serverless  cluster 
environment.  The  cooperation  is  established  at  the  Linux 
kernel  level,  rather  in  the  user  space. 

Petal  and  Tertiary  Disk  achieve  the  SIOS  at  the  levels 
of  user  level  device  drivers  and  xFS  file  system, 
respectively.  The  Digital  Petal  virtual  disks  was  built  in 
1996,  the  Berkeley  Tertiary  Disk  project  was  reported  in 
1998,  the  Princeton  TickerTAIP  parallel  RAID  was 
designed  at  1993,  and  our  RAID-x  built  at  USC  Trojans 
project  in  1999. 

The  entries  in  Table  1  distinguish  the  four  parallel  and 
distributed  RAID  architectures  in  four  aspects.  All  four 
I/O  subsystems  support  SIOS,  however  by  quite  different 
mechanisms.  All  four  parallel  and  distributed  RAIDs 
support  parallel  disk  I/O  at  the  block  level. 

The  first  distinction  among  the  four  distributed  RAIDs 
lies  in  their  architectures.  The  Petal  virtual  disk  array  uses 
chained  declustering,  Tertiary  Disk  applies  the  RAID-5, 
TickerTAIP  uses  parallel  disk  array  controllers  within 
single  RAID  server  to  implement  parallel  RAID-5,  and  we 
use  the  new  RAID-x  architecture. 

Our  major  contributions  lie  in  the  creation  of  the  OSM 
and  CDD  mechanisms.  The  enabling  mechanisms  for 
SIOS  are  also  quite  different  .among  the  four  architectures. 
TickerTAIP  achieves  SIOS  by  event-driven  simulation 
among  all  the  worker  nodes.  We  realize  the  SIOS  with 
cooperative  device  driver  at  the  Linux  kernel  level. 


Table  1  Parallel  and  Distributed  RAID  Projects  at  USC,  Princeton,  Digital  and  Berkeley 


System 

Attributes 

USC  Trojans 

RAID-x 

Princeton 
TickerTAIP  131 

Digital 

Petal  [17] 

Berkeley  Tertiary 
Disk  [261 

RAID 

Architecture 

Environment 

Orthogonal  striping 
and  mirroring  over 

The  RAID-x  in  a 

Linux  cluster 

RAID-5  with 
multiple 
controllers  in  a 
single  server 

Chained 
declustering  in 
an  Unix  cluster 

RAID-5  built  with 
a  Solaris  PC  cluster 

Enabling 
Mechanism  for 
SIOS 

Cooperative  device 
drivers  in  Linux 
kernels 

Single  server 
implements  the 
SIOS  directly 

Petal  device 
drivers  at  user 
level 

xFS  storage  servers 
at  file  system  level 

Data  Consistency 
Checking 

Locks  at  device 
driver  level 

Sequencing  of 
user  requests 

Supported  by 
Frangipani 
file  system 

Locks  in  the  xFS 
file  system 

Communication 

Mechanism 

TCP/IP 

Sockets 

Not  Available 

UDP/IP 

Sockets 

RPC  at 
user  level 
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Even  both  Petal  and  RAID-x  choose  the  device  driver 
approach,  their  implementations  are  very  different  under 
UNIX  user  level  and  Linux  kernel  level.  Petal  does 
provide  a  global  name  space  for  logical  disks  in  the 
cluster.  We  want  to  extend  the  global  name  space  to  each 
data  block  in  the  cluster. 

The  four  RAID  architectures  differ  in  their  handling  of 
the  data  consistency  problem  in  establishing  a  distributed 
file  management  system.  We  implemented  the  lock 
mechanisms  within  the  device  drivers.  Our  performance 
results  are  generated  in  Linux  cluster  environment. 

For  inter-node  communications,  we  use  the  TCP/IP 
sockets.  Regardless  of  their  differences,  we  believe  that 
hardware  and  software  experiences  learned  from 
distributed  RAID  projects  will  be  complementary  to  each 
other  in  many  aspects. 

For  parallel  writes,  the  RAID-x  has  lower  access  times 
than  RAID-1.  These  claims  are  based  on  benchmark 
results  to  be  presented  in  section  5.  To  sum  up,  the  RAID- 
x  scheme  demonstrates  scalable  I/O  bandwidth  with  much 
reduced  latency  in  a  cluster  environment. 

Using  the  CDDs,  a  cluster  can  be  built  serverless  and 
offers  remote  disk  access  directly  at  the  kernel  level. 
Parallel  I/O  is  made  possible  on  any  subset  of  local  disks, 
because  all  distributed  disks  form  a  SIOS.  No  heavy  cross¬ 
space  system  calls  are  needed  to  perform  remote  file 
accesses. 


3.  Orthogonal  Striping  and  Mirroring 

Over  the  years,  many  techniques  have  been  developed 
to  overcome  the  small- write  problem  [6]  [22],  such  as 
parity  logging  [24],  floating  parity  and  data  [20],  parity 
striping  [7],  disk  caching  disk  [13],  log-structured  disk 
subsystem  [19]  and  chained  declustering  [12].  The  concept 
of  OSM  started  with  our  earlier  work  [16]. 

In  this  paper,  we  present  the  design  details  of  RAID-x 
and  prove  its  effectiveness  through  experimentation. 
Figure  2  shows  the  architecture  of  RAID-x  (Fig.2b)  along 
with  RAID-1  (Fig.2a)  architectures.  The  original  data 
blocks  are  denoted  as  Di  in  the  unshaded  boxes.  The 
corresponding  image  blocks  are  distinguished  with  primes, 
such  as  Di’  in  the  shaded  boxes.  The  RAID-x  completely 
avoids  the  small  write  problem. 

As  shown  in  Fig. 2b,  data  blocks  in  RAID-x  are  striped 
across  the  disks  on  the  top  half  of  the  disk  array.  Low 
latency  and  high  bandwidth  of  RAID-0  are  preserved  in 
RAID-x  architecture.  The  image  blocks  of  other  data 


blocks  in  the  same  stripe  are  clustered  in  the  same  disk 
vertically.  All  image  blocks  occupy  the  lower  half  of  the 
disk  array.  On  a  RAID-x,  the  images  are  copied  and 
updated  at  the  background,  thus  saving  the  overhead  time. 

Consider  the  top  stripe  of  data  blocks  DO,  DI,  D2,  and 
D3  in  Fig.2b.  Their  image  blocks  DO’,  Dl\  and  D2*  are 
stored  in  Disk  3,  while  the  image  block  D3’  in  disk  2.  The 
rule  is  that  no  data  block  and  its  image  should  be  mapped 
in  the  same  disk.  Full  bandwidth  is  achievable  in  parallel 
disk  I/O  across  the  same  stripe. 

For  large  write,  the  data  blocks  are  written  in  parallel  to 
all  disks  simultaneously.  The  image  blocks  are  gathered  as 
a  long  block  written  into  the  same  disk  with  a  reduced 
latency.  In  case  of  the  small  write  of  a  single  block,  the 
writing  is  directed  to  the  data  block,  while  the  image  block 
is  postponed  to  write  to  the  disk  until  all  the  clustered 
image  blocks  are  ready. 

Disk  0  Disk  1  Disk  2  Disk  3 


(a)  Duplicated  striping  in  RAID-1 
DiskO  Disk  1  Disk  2  Disk  3 


(b)  Orthogonal  striping  and 
mirroring  in  RAID-x 

Fig.  2  The  mirroring  schemes  in 
RAID-1  and  RAID-x 

We  define  a  pair  of  functions  for  the  logical  data  block 
to  physical  RAID  mapping:  data- mapping-function  and 


mirror-mapping-f unction.  The  data-mapping-function  is  a 
one-to-one  function  which  maps  a  logical  RAID  block 
address  A  to  a  physical  disk  address  (DiskNo,  StripeNo). 
Mirror-mapping-function  maps  the  corresponding  image 
block  of  a  logical  block  address  to  a  physical  disk  address. 
The  A,  DiskNo,  and  StripeNo  count  from  0. 

We  define  n  as  the  number  of  disks  in  the  array,  k  as 
the  number  of  blocks  per  disk,  and  A  as  the  logical  RAID 
block  address.  Table  2  gives  out  the  data-mapping 
function  and  mirror-mapping  function  for  RAID-1  and 
RAID-x.  The  notation  mod  stands  for  arithmetic  modulo 
operation.  Table  2  also  lists  the  expected  peak 
performance  of  two  RAID  architectures. 

The  maximum  bandwidth  of  a  disk  array  reflects  the 
ideal  case  of  parallel  accesses  of  all  useful  data  blocks.  B 
stands  for  the  bandwidth  per  disk.  In  the  best  case,  a  full 
bandwidth  of  nB  can  be  delivered  by  RAID-x.  The  RAID- 
1  can  only  deliver  half  of  the  full  bandwidth.  The  parallel 
read  or  parallel  write  time  of  a  file  of  m  blocks  depends  on 
the  read  or  write  latencies  (R  and  W)  per  block,  the  array 
size  n,  and  the  file  size  m. 

The  entries  given  in  Table  2  are  expected  peak 
performance  of  parallel  disk  I/O  operations,  excluding  all 
software  overhead  or  network  delays.  In  case  of  large 
reads,  mR/n  latency  is  expected  to  perform  m/n  reads 
simultaneously  for  RAID-x,  while  RAID-1  needs  to 
double  the  latency.  For  small  read  of  a  single  block,  both 
require  R  time  to  finish  the  read. 

For  parallel  writes,  as  in  RAID-x,  the  image  blocks  are 
clustered  in  one  disk,  written  to  the  disk  at  the  same  time. 
That  is,  m/n(n- 1)  image  blocks  are  written  together  to  each 
disk.  Therefore,  the  large  write  latency  is  reduced  to  mW/n 
+  m/n(n- 1). 

For  small  writes,  our  RAID-x  takes  only  W  time  to 
write  the  data  block.  The  writing  of  the  image  blocks  will 


be  done  later  when  all  the  stripe  images  are  clustered  at 
the  same  disk.  This  clustered  writing  can  be  done  at  the 
background,  overlapping  with  the  regular  data  writes. 

Table  2  also  shows  the  maximum  number  of  disk 
failures  that  each  disk  array  can  tolerate.  The  RAID-x  can 
tolerate  single-disk  failures,  RAID-1  is  more  robust  than 
RAID-x.  The  experimental  results  in  section  6  will  verify 
the  accuracy  of  the  expected  performance. 

Figure  3  illustrates  an  example  of  the  two-dimensional 
RAID-x  architecture  with  3  disks  attached  to  each  node. 
The  maximum  number  of  disks  attached  to  each  SCSI 
controller  is  determined  by  the  SCSI  controller  used.  For 
Wide/Fast  SCSI-II,  15  disks  can  be  connected  to  one 
single  SCSI  controller. 

In  order  to  implement  SIOS,  addresses  of  all  the  data 
blocks  are  linearly  continuous  among  all  the  member 
disks.  Only  the  disks  with  same  position  corresponding  to 
each  node  belong  to  one  stripe  group.  All  the  disks  within 
stripe  group  can  be  accessed  in  parallel. 

Different  stripe  groups  are  independent.  As  all  the  disks 
within  one  node  are  connected  through  SCSI  bus,  different 
stripe  group  can  be  accessed  in  pipeline.  The  overlap 
degree  for  the  different  stripe  group  is  depends  on  the 
property  of  SCSI  bus  used. 

The  Trojans  cluster  is  presently  being  upgraded  to  4 
disks  per  node.  Using  20  GB  SCSI  disks,  the  next  RAID-x 
array  will  have  1.28  TB  on  64  disks.  In  the  future,  the 
Trojans  cluster  will  scale  to  hundreds  of  PC  nodes  or 
more,  using  next  generation  of  microprocessors  and 
Gigabit  switched  connections. 

Using  the  Fast  Ethernet,  the  aggregate  I/O  bandwidth  is 
at  most  12.5  MB/s.  As  reported  in  section  5,  we  have 
achieved  9.7  MB/s  bandwidth  for  large  parallel  reads.  This 
represents  78%  efficiency  in  the  cluster  utilization. 


Table  2  Architectural  Characteristics  of  RAID-1  and  RAID-x 


Performance 

Indicators 

RAID-1 

RAID-x 

Data  Block 
mapping 

DiskNo. 

A  mod  nil 

A  mod  n 

StripeNo. 

(2Aln)  mod  k 

(2  A/n )  mod  k 

Mirror-mapping 

function 

DiskNo. 

nil  +  A  mod  nil 

(-(A/(n  -  1 ))  mod  k/2  -  1)  mod  n 

StripeNo. 

(2 A/n)  mod  k 

k!l  +  (A/(n  -  1)  n)  (n  -  1)  +A  mod  (n  -  l) 

Max.  Bandwidth 

Read/W  rite 

nB/ 2 

nB 

Estimates 
of  Parallel 
Read/Write 

Time 

Large  Read 

2  mR/n 

mR/n 

Small  Read 

R 

R 

Large  Write 

2mW/n 

mW/n  +  m/n(n-\) 

Small  Write 

W 

=  w 

|  Max.  Fault  Coverage  | 

nil  disk  failures 

Single  disk  failure 
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Figure  3.  Distributed  RAID-x  architecture,  shown  with  a  4  x  3  configuration 

(P:  processor,  M:  memory,  CDD:  cooperative  disk  drivers.  All  shaded  blocks 
are  mirrored  images  of  the  corresponding  unshaded  data  blocks) 


With  a  128-node  cluster  and  8  disks  per  node,  the  disk 
array  could  be  enlarged  to  have  a  total  capacity  exceeding 
20  TB,  suitable  for  any  large-scale,  database  or 
multimedia  applications.  With  an  enlarged  array  of  128 
disks,  the  cluster  must  be  upgraded  to  a  Gigabit  switched 
connection.  Based  on  the  growing  I/O  bandwidth,  the 
Trojans  cluster  and  its  RAID-x  architecture  show  a  very 
promising  future  in  term  of  scalability  and  availability. 


4.  Cooperative  Disk  Drivers 

The  Single  I/O  space  (SIOS)  is  crucial  to  building 
scalable  cluster  of  computers.  A  loosely  coupled  cluster 
use  distributed  disks  driven  by  different  hosts 


independently.  The  independent  disk  drivers  handle 
distinct  I/O  address  spaces.  Without  the  SIOS,  remote  disk 
I/O  must  be  done  by  a  sequence  of  time-consuming 
system  calls  through  a  centralized  file  server  (such  as  the 
use  of  NFS)  across  the  cluster  network. 

On  the  other  hand,  the  CDDs  work  together  to  establish 
the  SIOS  across  all  physically  distributed  disks.  Once  the 
SIOS  is  established,  all  disks  are  used  collectively  as  a 
single  global  virtual  disk  shown  in  Fig.4a. 

Each  node  perceives  the  illusion  that  it  has  several 
physical  disks  attached  locally.  Figure  4b  shows  the 
internal  design  of  a  CDD.  Each  CDD  is  essentially  made 
from  three  working  modules.  The  storage  manager 
receives  and  processes  the  I/O  requests  from  remote  client 
modules.  The  client  module  redirects  local  I/O  requests  to 
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remote  disk  managers. 

The  consistency  module  is  responsible  for  maintaining 
data  consistency  among  distributed  disks.  A  CDD  can  be 
configured  to  run  as  a  storage  manager  or  as  a  client,  or 
both  at  the  same  time.  There  are  three  possible  states  of 
each  disk:  (1)  a  manager  to  coordinate  use  of  local  disk 
storage  by  remote  nodes,  (2)  a  client  accessing  remote 
disks  through  remote  disk  managers,  and  (3)  both  of  the 
above  functions. 


(a)  A  global  virtual  disk  with  a  SIOS 
formed  by  cooperative  disks 


The  OSM  scheme  outperforms  the  chained  declustering 
scheme  mainly  in  parallel  write  operations.  The  RAID-x 
scheme  demonstrates  scalable  I/O  bandwidth  with  much 
reduced  latency  in  a  cluster  environment.  Both  Petal  and 
Tertiary  Disk  achieve  the  SIOS  at  the  user  level.  We 
achieved  the  SIOS  at  the  Linux  kernel  level.  Using  the 
CDDs,  the  cluster  can  be  built  serverless  and  offers  remote 
disk  access  directly  at  the  kernel  level. 

Parallel  I/O  is  made  possible  on  any  subset  of  local 
disks,  because  all  distributed  disks  form  SIOS.  No  heavy 
cross-space  system  calls  are  needed  to  perform  remote  file 
access.  A  device  masquerading  technique  is  adopted  here. 
•Multiple  CDDs  run  cooperatively  to  redirect  I/O  requests 
to  remote  disks. 

Data  consistency  problems  arise  when  multiple  cluster 
nodes  have  cached  copies  of  the  same  set  of  data  blocks. 
The  xFS  approach  and  the  Frangipani  approach  maintain 
the  data  consistency  at  the  file  system  level.  In  our  design, 
data  consistency  checking  is  maintained  at  the  disk  driver 
level. 

Our  approach  simplifies  the  design  and  implementation 
of  distributed  file  management  services.  Data  consistency 
is  maintained  by  all  CDDs  with  higher  speed  and 
efficiency  at  the  data  block  level.  We  introduced  a  special 
lock-group  table  for  developing  distributed  file 
management  services. 

Each  record  in  this  table  corresponds  to  a  group  of  data 
blocks  that  have  been  granted  to  a  specific  CDD  client 
with  write  permissions.  The  write  locks  in  each  record  are 
granted  and  released  atomically.  This  lock-group  table  is 
replicated  among  the  data  consistency  modules  in  the 
CDDs.  Which  guarantee  that  file  management  operations 
are  performed  atomically. 


5.  Benchmark  Performance  Results 


(b)  The  CDD  architecture 

Figure  4  Single  I/O  space  in  RAID-x  built 
at  Linux  kernel  level. 

The  Petal  virtual  disk  array  uses  chained  declustering, 
Tertiary  Disk  applies  the  RAID-5,  and  we  use  the  new 
RAID-x  architecture.  The  major  innovations  in  RAID-x 
architecture  lie  in  the  creation  of  the  orthogonal  striping 
and  mirroring  in  mapping  the  data  blocks  and  their  images 
on  the  distributed  disks. 


To  test  the  cooperative  operations  among  the  CDDs 
residing  on  individual  PCs,  we  use  all  16  PCs  as  I/O 
storage  servers.  We  use  the  same  hardware  platform  to 
compare  the  relative  performance  of  two  disk  array 
architectures:  RAID-1  and  RAID-x,  all  supported  by 
CDDs.  The  NFS  is  used  as  a  baseline  for  comparison 
purposes.  Presently,  Linux  kernel  version  2.2.5  supports 
the  RAID-0,  RAID-1,  and  RAID-5  configurations. 

We  implemented  the  RAID-x  based  on  the  RAID-0 
implementation  supported  in  the  Linux  kernel.  This  poses 
no  difficulty  in  mapping  the  data  blocks  onto  the  top  half 
of  each  disk.  The  mapping  of  the  image  blocks  in  the 
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RAID-x  configuration  is  done  by  a  special  address 
translation  subroutine  residing  in  each  CDD.  To  study  the 
maximum  I/O  bandwidth  of  the  disk  array,  the  caches  in 
the  storage  servers  are  bypassed  by  issuing  a  special  sync 
command  in  the  Linux  kernel. 

For  reads  or  writes,  the  file  size  chosen  was  10MB. 
Each  block  (stripe  unit)  in  the  disk  is  4  KB.  This  means 
that  a  10-MB  file  is  striped  uniformly  across  all  16  disks 
in  consecutive  stripe  groups.  We  have  performed  three 
benchmark  experiments. 

The  first  two  experiments  measure  the  parallel  I/O 
performance  in  terms  of  the  throughput  or  the  aggregate 
I/O  bandwidth.  The  first  experiment  tests  the  throughput 
of  RAID-x,  RAID-1  and  the  NFS  against  the  number  of 
client  requests.  The  second  test  checks  the  bandwidth 
against  the  disk  array  size  for  RAID-1  and  RAID-x. 

The  distributed  file  system  is  evaluated  in  the  third 
experiment  using  the  standard  Andrew  Benchmark  [11] 
consisting  of  a  sequence  of  basic  file  system  testing 
programs.  There  are  five  phases  in  the  Andrew 
benchmark. 

The  first  phase  recursively  creates  subdirectories.  The 
second  phase  measures  the  data  transfer  capabilities  by 
copying  files.  The  third  phase  recursively  examines  the 
status  of  directories  and  the  associated  files.  The  fourth 
phase  scans  the  contents  of  each  file.  The  final  phase 
compiles  the  files  and  links  them  together. 


5.1.  Bandwidth  Results  and  Analysis 

Figure  5  shows  the  performance  of  RAID-x,  RAID-1 
and  NFS  architectures.  The  results  on  parallel  read  are 
given  in  Fig.  5a.  In  this  test,  each  client  reads  a  10MB- 
long  file  from  all  the  disks.  Therefore,  the  test  is  truly 
focused  on  the  parallel  I/O  capability  of  the  disk  array.  All 
the  files  are  set  to  be  uncached  and  each  client  only  reads 
its  own  private  file.  All  read  operations  are  performed 
simultaneously,  with  the  help  of  an  MPI_Barrier()  call. 

The  NFS  throughput  is  limited  at  2.6  MB/s  regardless 
of  the  number  of  clients,  due  to  the  fact  that  sequential  I/O 
is  performed  by  the  NFS  on  a  central  server.  As  the 
request  number  increases,  the  NFS  becoming  the 
bottleneck  shows  a  declining  performance.  RAID-x 
architectures  scale  up  to  a  bandwidth  of  9.7  MB/s  for  16 
clients.  RAID-1  lags  behind  with  a  show  of  6.33  MB/s  for 
16  clients. 

Fig.  5b  shows  the  write  bandwidths  of  the  RAID-x, 
RAID-1  and  NFS  subsystems.  In  this  test,  each  client 
writes  a  lOMB-long  file  to  the  cache  and  issues  a  special 
syncQ  call  to  flush  the  data  blocks  to  the  disks.  All  write 


operations  among  the  clients  are  also  synchronized  in 
these  experiments. 

The  NFS  scales  in  performance  up  to  4  requests.  As  the 
requests  exceed  4,  the  NFS  bandwidth  drops  to  a  low 
2.77MB/s.  For  writes  of  a  large  file,  RAID-x  achieves  the 
better  scalability  with  a  9.02MB/s  for  16  clients.  RAID-1 
saturates  early  to  a  5.95MB/s,  due  to  the  fact  that  only  half 
of  the  disks  are  used  for  data  storage. 


1  4  8  12  16 

Number  of  Clients 

(a)  Parallel  read 


(c)  Parallel  write 


Fig.  5  Aggregate  I/O  bandwidth  of  RAID-x, 
RAID-1  and  NFS  with  increasing  clients 


5.2.  Raw  I/O  Performance  of  RAID-x 

Raw  I/O  performance  is  plotted  in  Fig. 6  against  the 
disk  array  size.  The  results  are  shown  for  two  RAID 
architectures.  Again,  all  caches  are  bypassed  in  the 
experiments  and  the  number  of  client  processes  is  fixed  at 
16.  The  read  ranking  differs  from  the  write  ranking 


178 


sharply  in  these  plots. 

For  parallel  reads  (Fig.  6a),  the  data  size  has  very  little 
effects  on  the  relative  standings  of  two  RAIDs.  It  is 
important  to  note  that  the  read  bandwidth  of  RAID-x 
approaches  9.7  MB/s,  about  78%  of  12.5  MB/s,  the  limit 
of  a  100  Mbps  Fast  Ethernet.  The  difference  is  attributed 
mainly  to  the  CDD  protocol  and  TCP/IP  overheads 
incurred. 


(a)  Parallel  read 


(b)  Parallel  write 

Fig.  6  Aggregate  I/O  bandwidth  of  RAID-x  and 
RAID-1  with  increasing  disk  numbers 

For  parallel  writes,  the  large  write  bandwidths  of 
RAID-x  and  RAID-lare  9.02MB/s  and  5.72MB/s, 
respectively.  Table  3  shows  the  improvement  factor  of  16 
clients  over  1  client  in  using  the  16-node  Trojans  cluster. 
Comparing  with  Berkeley  xFS  results,  our  1 -client 
bandwidth  is  quite  high  due  to  well-exploited  parallelism 
in  16- way  striping  across  the  disk  array. 

For  this  reason,  the  improvement  factor  is  lower  than 
that  achieved  by  the  xFS  system.  Again,  the  RAID-x 
demonstrated  the  highest  improvement  factor  among  the 


three  distributed  RAID  architectures  and  the  NFS. 


5.3.  Andrew  Benchmark  Results 

Andrew  benchmark  tests  the  performance  of  a  network 
file  system.  In  this  experiment,  the  Andrew  benchmark 
was  executed  on  four  I/O  subsystems  with  respect  to 
increasing  number  of  client  requests  up  to  32.  The 
performance  is  indicated  by  the  elapsed  time  in  executing 
Andrew  benchmark  on  the  target  I/O  subsystem.  Figure  7 
shows  the  benchmark  results  for  RAID-x  and  NFS. 
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(b)  RAED-x  performance 

Fig.  7  Elapsed  time  to  execute  the  Andrew 
benchmark  on  the  Trojans  cluster 

These  tests  demonstrate  how  the  underlying  storage 
structures  can  affect  Hhe  performance  of  the  file  system 
being  supported.  Each  local  file  system  on  the  I/O  nodes 
mounts  the  “virtual”  storage  device  provided  by  the  CDD. 
The  number  of  I/O  nodes  is  fixed  at  16.  Each  client  only 
executes  its  own  private  copy  of  Andrew  benchmark.  We 
use  the  Linux  ext2  local  file  system  to  keep  the  operations 
on  metadata  atomic. 
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Table  3  Achievable  I/O  Bandwidth  and  Improvement  Factor  on  Trojans  Cluster 


I/O 

Operations 

NFS  "1 

RAID-x 

1  Client 

16  Clients 

Improve 

1  Client 

16  Clients 

Improve 

Large  Read 

2.58  MB/s 

2.3  MB/s 

0.89 

3.36  MB/s 

9.65  MB/s 

2.87 

Large  Write 

2.11  MB/s 

2.77  MB/s 

1.31 

3.12  MB/s 

9.02  MB/s 

2.89 

Small  Write 

2.47  MB/s 

2.81  MB/s  1 

1.34 

3.22  MB/s 

9.13  MB/s 

2.84 

Figure  7a  shows  the  benchmark  result  of  NFS,  while 
Figures  7b  shows  the  results  of  RAID-x.  It  is  obvious  that 
the  elapsed  time  in  using  NFS  increases  sharply  with  the 
number  of  clients,  while  the  RAID-x  scheme  can  sustain 
the  same  workload.  For  16  clients,  the  elapsed  times  for 
RAID-x  and  NFS  are  6.8  and  33  seconds,  respectively. 

For  32  clients,  these  numbers  increase  to  7.41  and  75.5, 
respectively.  From  Fig.  7a,  NFS  shows  a  worsening 
performance  especially  in  reading  the  files,  scanning 
directories,  and  copying  files  operations.  The  RAID-x 
architectures,  in  contrast,  do  not  share  this  weakness. 


6.  Striped  and  Staggered  Checkpointing 

The  parallel  I/O  characteristic  of  distributed  RAID-x 
architecture  can  be  applied  to  achieve  fast  checkpointing 
in  the  cluster  system.  Striped  checkpointing  method  is 
storing  checkpointing  file  over  distributed  RAID-x 
system.  To  alleviate  the  network  contention,  the  staggered 
writing  skill  is  combined  to  striped  checkpointing. 

Simultaneous  writing  of  multiple  processes  in 
coordinated  checkpointing  may  cause  a  network 
contention  and  I/O  bottleneck  problem  to  a  central  stable 
storage.  As  suggested  by  Vaidya  [28],  staggered  writing  of 
the  checkpoints  taken  by  different  nodes  reduces  the  above 
contentions.  The  time  lag  between  staggered 
checkpointers  can  alleviate  the  bottleneck  problem 
associated  with  the  central  stable  storage. 

The  basic  concept  of  staggered  checkpointing  allows 
only  one  process  to  store  the  checkpoint  at  a  time.  A  token 
is  passed  around  to  determine  the  timing.  When  a  node 
receives  the  token,  the  node  starts  to  store  the  checkpoint. 
After  finishing  checkpointing,  the  node  passes  the  token  to 
the  next  node. 

Our  work  on  coordinated  checkpointing  was  inspired 
by  the  previous  works  by  Cao  [4]  and  associates,  Chandy 
and  Lamport  [5],  and  Vaidya  [28].  In  our  scheme,  several 
nodes  within  the  cluster  form  a  striped  group.  Only  the 


nodes  within  the  same  striped  group  checkpoint 
simultaneously  and  each  of  the  groups  checkpoints  in  a 
staggered  way. 

Figure  8  shows  the  concept  of  striped  staggering  in 
coordinated  checkpointing  on  the  RAID-x  disk  array.  The 
drawing  shows  a  12-disk  RAID-x  array  configured  as  a  2- 
dimensional  structure,  i.e.  a  4  x  3  configuration.  Each 
stripe  corresponds  to  the  degree  of  parallelism  (DOP)  in 
concurrent  accesses  of  four  disks  in  the  4x3  disk  array. 


Process  0 
Process  1 
Process  2 
Process  3 
Process  4 
Process  5 
Process  6 
Process  7 
Process  8 
Process  9 
ProcessJO 
Process  11 


Time 


C:  Checkpointing  overhead 
S:  Synchronization  overhead 


Fig.  8.  Striped  checkpointing  with  staggering 
on  a  distributed  RAID-x 


Successive  stripes  are  accessed  in  a  staggered  manner 
from  different  stripes  on  successive  4-disk  groups,  as 
demonstrated  in  Fig.3.  Staggering  implies  pipelined 
accesses  of  the  disk  array.  We  first  proposed  the  idea  of 
striped  checkpointing  in  [23].  There  exists  trade-off 
between  stripe  parallelism  and  staggering  depth. 

For  example,  the  layout  in  Fig. 8  can  be  reconfigured 
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from  4x3toa6x2  configuration,  if  needed.  Higher  DOP 
leads  to  higher  aggregate  disk  bandwidth.  Higher 
staggering  degree  can  cope  better  the  network  contention 
problem.  The  staggered  writing  way  can  reduce  the 
average  checkpointing  overhead.  However,  in  the  case  of 
blocking  algorithm,  the  staggered  writing  method  also 
introduces  the  synchronization  time. 

Although  blocking  algorithm  is  the  simpler  than  non- 
blocking  algorithm  to  achieve  coordinated  checkpointing 
in  parallel  processing,  it  suffers  from  large  amount  of 
overhead.  Every  node  should  be  blocked  during  the 
checkpointing  procedure.  The  basic  idea  is  to  shut  down 
all  processes  temporary  to  define  consistent  state.  After  all 
the  processes  are  blocked  and  all  the  messages  are  clearly 
delivered,  the  global  checkpoints  are  stored.  In  the 
staggered  writing  case,  the  blocked  time  increases 
according  to  the  number  of  node. 


7.  Overhead  and  Reliability  Analysis 

Figure  9  shows  the  advantage  of  striped  staggering  on 
distributed  disk  array,  as  compared  with  staggering  in 
Vaidya  scheme  [28]  on  a  centralized  disk  and  the 
conventional  approach  using  the  NFS  server.  These 
preliminary  results  were  measured  on  the  small  prototype 
Trojans  cluster. 

Our  striped  checkpointing  scheme  has  the  lowest 
overhead,  especially  when  the  checkpoint  files  becomes 
very  large.  Through  continued  experiments  on  the 
enlarged  64-disk  RAID-x  cluster,  we  will  reveal  more 
experimental  results  on  the  checkpointing  overhead  and 
rollback  recovery  latency. 


Fig.  9  Checkpointing  overhead  of  staggered 
writing  on  distributed  RAIDs 

Table  4  summarizes  three  checkpointing  schemes  we 
have  compared  in  this  paper.  Their  advantages  and 
shortcomings  are  identified.  Suitable  applications  for  each 
checkpointing  scheme  are  also  elaborated. 

Using  the  OSM,  each  striped  checkpointing  file  has  its 
mirrored  image  in  its  local  disk.  For  each  node,  transit 
failure  can  be  recovered  from  its  mirrored  image  in  local 
disk.  Permanent  failure  of  a  disk  can  be  recovered  from 
the  striped  checkpointing  among  the  distributed  disks. 

The  I/O  performance  in  a  degraded  mode  of  OSM  is 
the  same  as  the  RAID-0  performance  in  a  normal  mode. 
The  striped  checkpointing  can  be  read  in  parallel  from 
RAID-x.  The  checkpointing  recovery  latency  can  be 
shortened  gfeatly. 


Table  4  Summary  of  Three  Coordinated  Checkpointing  Schemes 


Checkpointing  Scheme 

Advantages 

Shortcomings  * 

Suitable  applications 

Simultaneous  writing  to 
a  central  storage 
(The  NFS  scheme) 

Simple, 

no  inconsistent  state 

Has  network  and  I/O 
contentions,  NFS  is  single 
point  of  failure 

Small  size  of  checkpoint, 
small  number  of  nodes, 
low  I/O  operation 

Staggered  writing  to  a 
central  storage 
(Vaidya  scheme) 

Eliminate  the  network  and  I/O 
contention 

Network  bandwidth  is 
wasted,  NFS  is  a  singly 
point  of  failure 

Small  size  of  checkpointers, 
small  number  of  nodes, 
low  I/O  operations 

Striped  staggering 
checkpointing  on  any 
distributed  RAID 
(Our  scheme) 

Eliminate  network  and  I/O 
contentions,  low  checkpoint 
overhead,  fully  utilize  network 
bandwidth,  tolerate  multiple 
failures  among  stripe  groups ' 

Can  not  tolerate  more 
node  failures  within 
each  stripe  group 

Large  size  of  checkpointers, 
large  number  of  nodes, 
low  communication, 

I/O  intensive  applications 
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According  to  the  mirror  mapping  of  the  OSM,  the 
proposed  RAID-x  architecture  can  recover  from  any  single 
disk  failure  in  each  stripe  group.  The  total  number  of  disk 
failure  depends  on  the  number  of  stripe  groups  to  be 
accessed.  For  the  4  x  3  configuration  in  Fig.3,  three  disk 
failures  in  three  stripe  groups  can  be  tolerated.  An  indepth 
analysis  of  the  reliability  of  the  proposed  checkpointing 
RAID-x  architecture  is  given  in  [23]. 


8.  Conclusions 

The  development  pf  the  new  RAID-x  architecture  was 
inspired  by  several  research  projects.  The  xFS  and  the 
Tertiary  Disk  projects  at  Berkeley  [26],  and  the  Petal 
project  at  Compaq  Digital  [17],  all  have  influenced  our 
design  philosophy.  The  main  difference  between  our 
approach  and  these  projects  is  that  we  use  the  orthogonal 
striping  and  mirroring  (OSM)  to  preserve  both  parallel 
disk  accesses  and  staggered  (pipelined)  checkpointing  of 
successive  stripes. 

We  built  data  consistency  checking  in  the  device  driver 
level.  The  CDDs  work  cooperatively  to  perform  data 
transferring  and  consistency  checking.  With  the  support  of 
CDDs,  the  design  of  a  distributed  file  system  can  be 
focused  on  the  concurrent  file  access  policies  and  the 
related  performance  issues.  In  this  case,  the  complexity  of 
the  distributed  file  system  can  be  greatly  reduced  Our 
SIOS  disk  array  separates  the  I/O  subsystem  into  a 
distributed  file  system  and  a  set  of  distributed  CDDs. 

All  SSI  services  are  provided  by  the  CDDs  while  the 
file  system  modification  is  reduced  to  a  minimum. 
Furthermore,  some  desired  SSI  services  for  cluster 
computing  can  be  built  on  top  of  the  SIOS.  In  this  aspect, 
the  SIOS  is  a  very  powerful  middleware  infrastructure  to 
achieve  single-system  image.  Benchmark  performance 
results  show  that  our  distributed  RAID  can  achieve 
scalability,  performance,  and  availability  in  cluster 
computing. 

The  RAID-x  outperforms  the  RAID-1  in  the  Linux 
cluster  environment.  For  parallel  reads  with  16  active 
clients,  the  RAID-x  achieved  9.7  MB/s  throughput,  1.5 
and  3.7  times  higher  than  using  RAID-1  and  NFS, 
respectively.  Running  the  Andrew  benchmark,  RAID-x 
results  in  a  17%  cut  in  elapsed  time,  compared  with  that 
experienced  on  a  RAID-1.  The  achieved  throughput 
corresponds  to  78%  of  the  peak  bandwidth  deliverable  by 
the  Fast  Ethernet.  Scalable  I/O  bandwidth  makes  the 
RAID-x  especially  appealing  to  I/O-centric  cluster 
applications. 


The  OSM  mechanisms  can  be  built  not  only  on  Linux 
PC  clusters,  but  also  on  any  Unix  workstation  clusters. 
These  architectural  features  differ  from  the  user-level 
designs  in  Berkeley  Tertiary  Disk  and  Digital  Petal  virtual 
disks.  The  new  mechanisms  support  not  only  single  I/O 
space ,  but  also  distributed  shared  memory ,  checkpointing , 
and  distributed  file  management  at  the  kernel  level  without 
using  cross-space  system  calls. 

The  prototype  RAID-x  has  the  following  open  issues 
yet  to  be  solved  in  future  R/D  efforts.  These  extended 
works  are  among  the  tasks  planned  in  the  next  phase  of 
our  Trojans  cluster  project. 

(1) .  We  expect  even  Higher  performance  as  we  continue 
improving  the  CDD  protocol.  The  current  hand  shaking 
protocol  could  be  improved  with  prefetching  techniques. 
The  TCP/IP  used  in  our  prototype  is  known  for  its  high 
overhead.  Plan  is  underway  to  port  the  whole  cluster 
system  with  a  low-latency  protocol,  expecting  to  further 
reduce  the  communication  overhead. 

(2) .  We  plan  to  design  a  distributed  file  system  with  I/O 
load  balancing  capabilities  along  with  an  enlarged 
distributed  disk  array  onto  our  Trojans  cluster  in  the 
future.  In  addition  to  consider  the  RAID-1,  RAID-5,  and 
RAID-x  configurations,  we  will  also  consider  other 
configurations,  such  as  RAID- 10  and  chained 
declustering. 

(3) .  Our  PC  nodes  in  the  Trojans  cluster  act  as  clients 
as  well  as  storage  servers  at  the  same  time.  These  dual 
roles  affect  the  performance  of  the  I/O  nodes.  We  believe 
that  the  I/O  performance  can  be  further  improved  with  an 
enlarged  cluster  size. 

(4) .  We  plan  to  develop  a  suite  of  middleware  with 
striped  staggering  checkpointing  to  support  process 
migration.  Based  on  future  Trojans  cluster  configuration, 
more  detailed  analysis  of  the  DOP  and  depth  of  staggering 
will  be  conducted. 

(5) .  New  message  logging  algorithms  for  non-blocking 
striped  checkpointing  will  be  developed  to  reduce 
checkpointing  overhead  furthermore.  We  also  plan  to 
design  an  application  dependent  checkpointing  scheme  to 
elaborate  the  efficiency  of  striped  checkpointing. 

Lots  of  interesting  research  work  can  be  generated  out 
of  a  very  large  disk  array  in  real-life  applications.  Potential 
applications  are  encouraged  in  biological  sequence 
analysis,  collaborative  engineering  design,  clusters  or 
grids  for  E-commerce,  specialized  digital  libraries,  and 
distributed  multimedia  processing. 
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Abstract 

A  distributed  heterogeneous  computing  (HC)  system 
consists  of  diversely  capable  machines  harnessed  togeth¬ 
er  to  execute  a  set  of  tasks  that  vary  in  their  computation¬ 
al  requirements.  Heuristics  are  needed  to  map  ( match  and 
schedule)  tasks  onto  machines  in  an  HC  system  so  as  to 
optimize  some  figure  of  merit.  This  paper  characterizes  a 
simulated  HC  environment  by  using  the  expected  execution 
times  of  the  tasks  that  arrive  in  the  system  onto  the  different 
machines  present  in  the  system.  This  information  is  ar¬ 
ranged  in  an  “ expected  time  to  compute  ”  (ETC)  matrix  as  a 
model  of  the  given  HC  system ,  where  the  entry  ( i,  j )  is  the  ex¬ 
pected  execution  time  of  task  i  on  machine  j.  This  model  is 
needed  to  simulate  different  HC  environments  to  allow  test¬ 
ing  of  relative  performance  of  differen  t  mapping  heuristics 
under  different  circumstances.  In  particular,  the  ETC  mod¬ 
el  is  used  to  express  the  heterogeneity  among  the  runtimes 
of  the  tasks  to  be  executed,  and  among  the  machines  in  the 
HC  system.  An  existing  range-based  technique  to  generate 
ETC  matrices  is  described.  A  coefficient- of -variation  based 
technique  to  generate  ETC  matrices  is  proposed,  and  com¬ 
pared  with  the  range-based  technique.  The  co efficient- of  - 
variation-based  ETC  generation  method  provides  a  greater 
control  over  the  spread  of  values  ( i.e .,  heterogeneity)  in  any 
given  row  or  column  of  the  ETC  matrix  than  the  range- 
based  method. 

1.  Introduction 

A  distributed  heterogeneous  computing  (HC)  system 
consists  of  diversely  capable  machines  harnessed  togeth¬ 
er  to  execute  a  set  of  tasks  that  vary  in  their  computation¬ 
al  requirements.  Heuristics  are  needed  to  map  (match  and 
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der  the  NPS  subcontract  numbers  N62271-98-M-0217  and  N62271-98-M- 
0448,  and  under  the  GSA  subcontract  number  GS09K99BH0250.  Some 
of  the  equipment  used  was  donated  by  Intel. 


schedule)  tasks  onto  machines  in  an  HC  system  so  as  to  op¬ 
timize  some  figure  of  merit.  The  heuristics  that  match  a  task 
to  a  machine  can  vary  in  the  information  they  use.  For  ex¬ 
ample,  the  current  candidate  task  can  be  assigned  to  the  ma¬ 
chine  that  becomes  available  soonest  (even  if  the  task  may 
take  a  much  longer  time  to  execute  on  that  machine  than 
elsewhere).  In  another  approach,  the  task  may  be  assigned 
to  the  machine  where  it  executes  fastest  (but  ignores  when 
that  machine  becomes  available).  Or  the  current  candidate 
task  may  be  assigned  to  the  machine  that  completes  the  task 
soonest,  i.e.,  the  machine  which  minimizes  the  sum  of  task 
execution  time  and  the  machine  ready  time,  where  machine 
ready  time  for  a  particular  machine  is  the  time  when  that 
machine  becomes  available  after  having  executed  the  tasks 
previously  assigned  to  it  (e.g.,  [13]). 

The  discussion  above  should  reveal  that  more  sophisti¬ 
cated  (and  possibly  wiser)  approaches  to  the  mapping  prob¬ 
lem  require  estimates  of  the  execution  times  of  all  tasks  (that 
can  be  expected  to  arrive  for  service)  on  all  the  machines 
present  in  the  HC  suite  to  make  better  mapping  decisions. 
One  aspect  of  the  research  on  HC  mapping  heuristics  ex¬ 
plores  the  behavior  of  the  heuristics  in  different  HC  envi¬ 
ronments.  The  ability  to  test  the  relative  performance  of 
different  mapping  heuristics  under  different  circumstances 
necessitates  that  there  be  a  framework  for  generating  simu¬ 
lated  execution  times  of  all  the  tasks  in  the  HC  system  on  all 
the  machines  in  the  HC  system.  Such  a  framework  would, 
in  turn,  require  a  quantification  of  heterogeneity  to  express 
the  variability  among  the  runtimes  of  the  tasks  to  be  execut¬ 
ed,  and  among  the  capabilities  of  the  machines  in  the  HC 
system.  The  goal  of  this  paper  is  to  present  a  methodology 
for  synthesizing  simulated  HC  environments  with  quantifi¬ 
able  levels  of  task  and  machine  heterogeneity.  This  paper 
characterizes  the  HC  environments  so  that  it  will  be  easier 
for  the  researchers  to  describe  the  workload  and  the  ma¬ 
chines  used  in  their  simulations  using  a  common  scale. 

Given  a  set  of  heuristics  and  a  characterization  of  HC 
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environments,  one  can  determine  the  best  heuristic  to  use  in 
a  given  environment  for  optimizing  a  given  objective  func¬ 
tion.  In  addition  to  increasing  one’s  understanding  of  the 
operation  of  different  heuristics,  this  knowledge  can  help  a 
working  resource  management  system  select  which  mapper 
to  use  for  a  given  real  HC  environment. 

This  research  is  part  of  a  DARPA/ITO  Quorum  Pro¬ 
gram  project  called  MSHN  (pronounced  “mission”)  (Man¬ 
agement  System  for  Heterogeneous  Networks)  [7].  MSHN 
is  a  collaborative  research  effort  that  includes  the  Naval 
Postgraduate  School,  NOEMIX,  Purdue,  and  University  of 
Southern  California.  It  builds  on  SmartNet,  an  implemented 
scheduling  framework  and  system  for  managing  resources 
in  an  HC  environment  developed  at  NRaD  [5].  The  techni¬ 
cal  objective  of  the  MSHN  project  is  to  design,  prototype, 
and  refine  a  distributed  resource  management  system  that 
leverages  the  heterogeneity  of  resources  and  tasks  to  deliver 
the  requested  qualities  of  service.  The  methodology  devel¬ 
oped  here  for  generating  simulated  HC  environments  may 
be  used  to  design,  analyze  and  evaluate  heuristics  for  the 
Scheduling  Advisor  component  of  the  MSHN  prototype. 

The  rest  of  this  paper  is  organized  as  follows.  A  mod¬ 
el  for  describing  an  HC  system  is  presented  in  Section  2. 
Based  on  that  model,  two  techniques  for  simulating  an  HC 
environment  are  described  in  Section  3.  Section  4  briefly 
discusses  analyzing  the  task  execution  time  information 
from  real  life  HC  scenarios.  Some  related  work  is  outlined 
in  the  Section  5. 

2.  Modeling  Heterogeneity 

To  better  evaluate  the  behavior  of  mapping  heuristics, 
a  model  of  the  execution  times  of  the  tasks  on  the  ma¬ 
chines  is  needed  so  that  the  parameters  of  this  model  can 
be  changed  to  investigate  the  performance  of  the  heuristics 
under  different  HC  systems  and  under  different  types  of 
tasks  to  be  mapped.  One  such  model  consists  of  an 
expected  time  to  compute  (ETC)  matrix,  where  the  entry(/, 
j)  is  the  expected  execution  time  of  task  i  on  machine  j.  The 
ETC  matrix  can  be  stored  on  the  same  machine  where  the 
mapper  is  stored,  and  contains  the  estimates  for  the  expect¬ 
ed  execution  times  of  a  task  on  all  machines,  for  all  the  tasks 
that  are  expected  to  arrive  for  service  over  a  given  interval 
of  time.  (Although  stored  with  the  mapper,  the  ETC  infor¬ 
mation  may  be  derived  from  other  components  of  a  resource 
management  system  (e.g.,  [7])).  In  an  ETC  matrix,  the  el¬ 
ements  along  a  row  indicate  the  estimates  of  the  expected 
execution  times  of  a  given  task  on  different  machines,  and 
those  along  a  column  give  the  estimates  of  the  expected  ex¬ 
ecution  times  of  different  tasks  on  a  given  machine. 

The  exact  actual  task  execution  times  on  all  machines 
may  not  be  known  for  all  tasks  because,  for  example,  they 
might  be  a  function  of  input  data.  What  is  typically  as¬ 
sumed  in  the  HC  literature  is  that  estimates  of  the  expected 


execution  times  of  tasks  on  all  machines  are  known  (e.g., 
[6,  10,  12,  16]).  These  estimates  could  be  built  from  task 
profiling  and  machine  benchmarking,  could  be  derived  from 
the  previous  executions  of  a  task  on  a  machine,  or  could  be 
provided  by  the  user  (e.g.,  [3,  6,  8,  14,  18]). 

The  ETC  model  presented  here  can  be  characterized  by 
three  parameters:  machine  heterogeneity,  task  heterogene¬ 
ity,  and  consistency.  The  variation  along  a  row  is  referred 
to  as  the  machine  heterogeneity;  this  is  the  degree  to  which 
the  machine  execution  times  vary  for  a  given  task  [1].  A 
system’s  machine  heterogeneity  is  based  on  a  combination 
of  the  machine  heterogeneities  for  all  tasks  (rows).  A  sys¬ 
tem  comprised  mainly  of  workstations  of  similar  capabil¬ 
ities  can  be  said  to  have  “low”  machine  heterogeneity.  A 
system  consisting  of  diversely  capable  machines,  e.g.,  a  col¬ 
lection  of  SMP’s,  workstations,  and  supercomputers,  may 
be  said  to  have  “high”  machine  heterogeneity. 

Similarly,  the  variation  along  a  column  of  an  ETC  matrix 
is  referred  to  as  the  task  heterogeneity;  this  is  the  degree  to 
which  the  task  execution  times  vary  for  a  given  machine  [1]. 
A  system’s  task  heterogeneity  is  based  on  a  combination  of 
the  task  heterogeneities  for  all  machines  (columns).  “High” 
task  heterogeneity  may  occur  when  the  computational  need- 
s  of  the  tasks  vary  greatly,  e.g.,  when  both  time-consuming 
simulations  and  fast  compilations  of  small  programs  are 
performed.  “Low”  task  heterogeneity  may  typically  be  seen 
in  the  jobs  submitted  by  users  solving  problems  of  similar 
complexity  (and  hence  have  similar  execution  times  on  a 
given  machine). 

Based  on  the  above  idea,  four  categories  were  proposed 
for  the  ETC  matrix  in  [1]:  (a)  high  task  heterogeneity  and 
high  machine  heterogeneity,  (b)  high  task  heterogeneity  and 
low  machine  heterogeneity,  (c)  low  task  heterogeneity  and 
high  machine  heterogeneity,  and  (d)  low  task  heterogeneity 
and  low  machine  heterogeneity. 

The  ETC  matrix  can  be  further  classified  into  two  cat¬ 
egories,  consistent  and  inconsistent  [1],  which  are  orthog¬ 
onal  to  the  previous  classifications.  For  a  consistent  ETC 
matrix,  if  a  machine  mx  has  a  lower  execution  time  than 
a  machine  my  for  a  task  4,  then  the  same  is  true  for  any 
task  tj.  A  consistent  ETC  matrix  can  be  considered  to  rep¬ 
resent  an  extreme  case  of  low  task  heterogeneity  and  high 
machine  heterogeneity.  If  machine  heterogeneity  is  high  e- 
nough,  then  the  machines  may  be  so  much  different  from 
each  other  in  their  compute  power  that  the  differences  in 
the  computational  requirements  of  the  tasks  (if  low  enough) 
will  not  matter  in  determining  the  relative  order  of  execu¬ 
tion  times  for  a  given  task  on  the  different  machines  (i.e., 
along  a  row).  As  a  trivially  extreme  example,  consider  a 
system  consisting  of  Intel  Pentium  III  and  Intel  286.  The 
Pentium  III  will  almost  always  run  any  given  task  from  a 
certain  set  of  tasks  faster  than  the  286  provided  the  compu¬ 
tational  requirements  of  all  tasks  in  the  set  are  similar  (i.e., 
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low  task  heterogeneity),  thereby  giving  rise  to  a  consistent 
ETC  matrix. 

In  inconsistent  ETC  matrices,  the  relationships  among 
the  task  computational  requirements  and  machine  capabili¬ 
ties  are  such  that  no  structure  as  that  in  the  consistent  case 
is  enforced.  Inconsistent  ETC  matrices  occur  in  practice 
when:  (1)  there  is  a  variety  of  different  machine  architec¬ 
tures  in  the  HC  suite  (e.g.,  parallel  machines,  superscalars, 
workstations),  and  (2)  there  is  a  variety  of  different  com¬ 
putational  needs  among  the  tasks  (e.g.,  readily  paralleliz- 
able  tasks,  difficult  to  parallelize  tasks,  tasks  that  are  float¬ 
ing  point  intensive,  simple  text  formatting  tasks).  Thus,  the 
way  in  which  a  task’s  needs  correspond  to  a  machine’s  ca¬ 
pabilities  may  differ  for  each  possible  pairing  of  tasks  to 
machines. 

A  combination  of  these  two  cases,  which  may  be  more 
realistic  in  many  environments,  is  the  partially-consistent 
ETC  matrix,  which  is  an  inconsistent  matrix  with  a  consis¬ 
tent  sub-matrix  [2,  13].  This  sub-matrix  can  be  composed 
of  any  subset  of  rows  and  any  subset  of  columns.  As  an  ex¬ 
ample,  in  a  given  partially-consistent  ETC  matrix,  50%  of 
the  tasks  and  25%  of  the  machines  may  define  a  consistent 
sub-matrix. 

Even  though  no  structure  is  enforced  on  an  inconsistent 
ETC  matrix,  a  given  ETC  matrix  generated  to  be  inconsis¬ 
tent  may  have  the  structure  of  a  partially  consistent  ETC 
matrix.  In  this  sense,  partially-consistent  ETC  matrices  are 
a  special  case  of  inconsistent  ETC  matrices.  Similarly,  con¬ 
sistent  ETC  matrices  are  special  cases  of  inconsistent  and 
partially-consistent  ETC  matrices. 

It  should  be  noted  that  this  classification  scheme  is  used 
for  generating  ETC  matrices.  Later  in  this  paper,  it  will 
be  shown  how  these  three  cases  differ  in  generation  pro¬ 
cess.  If  one  is  given  an  ETC  matrix,  and  is  asked  to  classify 
it  among  these  three  classes,  it  will  be  called  a  consistent 
ETC  matrix  only  if  it  is  fully  consistent.  It  will  be  called 
inconsistent  if  it  is  not  consistent. 

Often  an  inconsistent  ETC  matrix  will  have  some  par¬ 
tial  consistency  in  it.  For  example,  a  trivial  case  of  partial- 
consistency  always  exists;  for  any  two  machines  in  the  HC 
suite,  at  least  50%  of  the  tasks  will  show  consistent  execu¬ 
tion  times. 

3.  Generating  the  ETC  Matrices 
3.1.  Range  Based  ETC  Matrix  Generation 

Any  method  for  generating  the  ETC  matrices  will  require 
that  heterogeneity  be  defined  mathematically.  In  the  range- 
based  ETC  generation  technique,  the  heterogeneity  of  a  set 
of  execution  time  values  is  quantified  by  the  range  of  the 
execution  times  [2, 1 3] .  The  procedures  given  in  this  section 
for  generating  the  ETC  matrices  produce  inconsistent  ETC 
matrices.  It  is  shown  later  in  this  section  how  consistent  and 


(1)  fori  from  Oto  (t- 1) 

(2)  x[i]  =  U(l,  R'aslc) 

(3)  for;  from  Oto  (m- 1) 

(4)  e[i,j]=T[{\xU(l,Rmach) 

(5)  endfor 

(6)  endfor 


Figure  1.  The  range-based  method  for  gener¬ 
ating  ETC  matrices. 


partially-consistent  ETC  matrices  could  be  obtained  from 
the  inconsistent  ETC  matrices. 

Assume  m  is  the  total  number  of  machines  in  the  HC 
suite,  and  t_  is  the  total  number  of  tasks  expected  to  be 
serviced  by  the  HC  system  over  a  given  interval  of  time. 
Let  U (a,  b)  be  a  number  sampled  from  a  uniform  dis¬ 
tribution  with  a  range  from  a  to  b.  (Each  invocation  of 
U ( a ,  b)  returns  a  new  sample.)  Let  Rtask  and  Rmach  be  num¬ 
bers  representing  task  heterogeneity  and  machine  hetero¬ 
geneity,  respectively,  such  that  higher  values  for  Rtask  and 
Rmach  represent  higher  heterogeneities.  Then  an  ETC  ma¬ 
trix  e[0..(t  -  l),0..(m  -  1)],  for  a  given  task  heterogeneity 
and  a  given  machine  heterogeneity,  can  be  generated  by  the 
range-based  method  given  in  Figure  1,  where  e[ij]  is  the 
estimated  expected  execution  time  for  the  task  i  on  the  ma¬ 
chine  j. 

As  shown  in  Figure  1,  each  iteration  of  the  outer  for  loop 
samples  a  uniform  distribution  with  a  range  from  1  to  Rtask 
to  generate  one  value  for  a  vector  x.  For  each  element  of  x 
thus  generated,  the  m  iterations  of  the  inner  for  loop  (Line 
3)  generate  one  row  of  the  ETC  matrix.  For  the  i-th  iteration 
of  the  outer  for  loop,  each  iteration  of  the  inner  for  loop 
produces  one  element  of  the  ETC  matrix  by  multiplying  x[i] 
with  a  random  number  sampled  from  a  uniform  distribution 
ranging  from  1  to  Rmach. 

In  the  range-based  ETC  generation,  it  is  possible  to 
obtain  high  task  heterogeneity  low  machine  heterogeneity 
ETC  matrices  with  characteristics  similar  to  that  of  low  task 
heterogeneity  high  machine  heterogeneity  ETC  matrices  if 
Rtask  =  Rmach •  In  realistic  HC  systems,  the  variation  that 
tasks  show  in  their  computational  needs  is  generally  larg¬ 
er  than  the  variation  that  machines  show  in  their  capabil¬ 
ities.  Therefore  it  is  assumed  here  that  requirements  of 
high  heterogeneity  tasks  are  likely  to  be  more  “heteroge¬ 
neous”  than  the  capabilities  of  high  heterogeneity  machines 
(i.e.,  R^  >  Rmach)-  However,  for  the  ETC  matrices  gen¬ 
erated  here,  low  heterogeneity  in  both  machines  and  tasks 
is  assumed  to  be  same.  Table  1  shows  typical  values  for 
Rtask  and  Rmach  for  low  and  high  heterogeneities.  Tables  2 
through  5  show  four  ETC  matrices  generated  by  the  range- 
based  method.  The  execution  time  values  in  Table  2  are 
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Table  1.  Suggested  values  for  RtaSk  and 
Rmach  for  a  realistic  HC  system  for  high  het¬ 
erogeneity  and  low  heterogeneity. 


high 

low 

task 

105 

101 

machine 

102 

101 

much  higher  than  the  execution  time  values  in  Table  5.  The 
difference  in  the  values  between  these  two  tables  would 
be  reduced  if  the  range  for  the  low  task  heterogeneity  was 
changed  to  102  to  104  instead  of  1  to  10. 

With  the  range-based  method,  low  task  heterogeneity 
high  machine  heterogeneity  ETC  matrices  tend  to  have  high 
heterogeneity  for  both  tasks  and  machines,  due  to  method 
used  for  generation.  For  example,  in  Table  5,  original  T 
vector  values  were  selected  from  1  to  1 0.  When  each  entry 
is  multiplied  by  a  number  from  1  to  100  for  high  machine 
heterogeneity  this  generates  a  task  heterogeneity  compara¬ 
ble  to  machine  heterogeneity.  It  is  shown  later  in  Section 
3.2  how  to  produce  low  task  heterogeneity  high  machine 
heterogeneity  ETC  matrices  which  do  show  low  task  het¬ 
erogeneity. 

3.2.  Coefficient-of- Variation  Based  ETC  Matrix 
Generation 

A  modification  of  the  procedure  in  Figure  1  defines  the 
coefficient  of  variation,  V,  of  execution  time  values  as  a 
measure  of  heterogeneity  (instead  of  the  range  of  execu¬ 
tion  time  values).  The  coefficient  of  variation  of  a  set  of 
values  is  a  better  measure  of  the  dispersion  in  the  values 
than  the  standard  deviation  because  it  expresses  the  stan¬ 
dard  deviation  as  a  percentage  of  the  mean  of  the  values 
[11],  Let  a  and  p  be  the  standard  deviation  and  mean,  re¬ 
spectively,  of  a  set  of  execution  time  values.  Then  V  =  o'  //.'. 
The  coefficient-of-variation-based  ETC  generation  method 
provides  a  greater  control  over  spread  of  the  execution  time 
values  (i.e.,  heterogeneity)  in  any  given  row  or  column  of 
the  ETC  matrix  than  the  range-based  method. 

The  coefficient-of-variation-based  (CVB)  ETC  genera¬ 
tion  method  works  as  follows.  A  task  vector,  q,  of  expected 
execution  times  with  the  desired  task  heterogeneity  must  be 
generated.  Essentially,  q[i\  is  the  execution  time  of  task  i  on 
an  “average”  machine  in  the  HC  suite.  For  example,  if  the 
HC  suite  consists  of  an  IBM  SP/2,  an  Alpha  server,  and  a 
Sun  SPARC  5  workstation,  then  q  would  represent  estimat¬ 
ed  execution  times  of  the  tasks  on  the  Alpha  server. 

To  generate  q,  two  input  parameters  are  needed:  /',.„ vi¬ 


and  V,ask.  The  input  parameter,  /w  is  used  to  set  the  av¬ 
erage  of  the  values  in  q.  The  input  parameter  Vtask  is  the 
desired  coefficient  of  variation  of  the  values  in  q.  The  value 
of  V,ask  quantifies  task  heterogeneity,  and  is  larger  for  high¬ 
er  task  heterogeneity.  Each  element  of  the  task  vector  q  is 
then  used  to  produce  one  row  of  the  ETC  matrix  such  that 
the  desired  coefficient  of  variation  of  values  in  each  row  is 
Vmaci, ,  another  input  parameter.  The  value  of  Vmaci,  quanti- 
fies  machine  heterogeneity,  and  is  larger  for  higher  machine 
heterogeneity.  Thus/u,,,,*,  Vlask,  and  Vmilc),  are  the  three  input 
parameters  for  the  CVB  ETC  generation  method. 

A  direct  approach  to  simulating  HC  environments  should 
use  the  probability  distribution  that  is  empirically  found  to 
represent  closely  the  distribution  of  task  execution  times. 
However,  no  standard  benchmarks  for  HC  systems  are  cur¬ 
rently  available.  Therefore,  this  research  uses  a  distribution 
which,  though  not  necessarily  reflective  of  an  actual  HC 
scenario,  is  flexible  enough  to  be  adapted  to  one.  Such  a 
distribution  should  not  produce  negative  values  of  task  ex¬ 
ecution  times  (e.g.,  ruling  out  Gaussian  distribution),  and 
should  have  a  variable  coefficient  of  variation  (e.g.,  ruling 
out  exponential  distribution). 

The  gamma  distribution  is  a  good  choice  for  the  CVB 
ETC  generation  method  because,  with  proper  constraints  on 
its  characteristic  parameters,  it  can  approximate  two  other 
probability  distributions,  namely  the  Erlang-k  and  Gaussian 
(without  the  negative  values)  [11,  15].  The  fact  that  it  can 
approximate  these  two  other  distributions  is  helpful  because 
this  increases  the  chances  that  the  simulated  ETC  matrices 
could  be  synthesized  closer  to  some  real  life  HC  environ¬ 
ment. 

The  uniform  distribution  can  also  be  used  but  is  not  as 
flexible  as  the  gamma  distribution  for  two  reasons:  (1)  it 
does  not  approximate  any  other  distribution,  and  (2)  the 
characteristic  parameters  of  a  uniform  distribution  cannot 
take  all  real  values  (explained  later  in  the  Section  3.3). 

The  gamma  distribution  [11,  15]  is  defined  in  terms  of 
characteristic  shape  parameter,  a,  and  scale  parameter,  p. 
The  characteristic  parameters  of  the  gamma  distribution  can 
be  fixed  to  generate  different  distributions.  For  example, 
if  a  is  fixed  to  be  an  integer,  then  the  gamma  distribution 
becomes  an  Erlang-k  distribution.  If  a  is  large  enough,  then 
the  gamma  distribution  approaches  a  Gaussian  distribution 
(but  still  does  not  return  negative  values  for  task  execution 
times). 

Figures  2(a)  and  2(b)  show  how  a  gamma  density  func¬ 
tion  changes  with  the  shape  parameter  a.  When  the  shape 
parameter  increases  from  two  to  eight,  the  shape  of  the  dis¬ 
tribution  changes  from  a  curve  biased  to  the  left  to  a  more 
balanced  bell-like  curve.  Figures  2(a),  2(c)  and  2(d)  show 
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Table  2.  A  high  task  heterogeneity  low  machine  heterogeneity  matrix  generated  by  the  range-based 
method  using  Rtask  and  ^ ch  values  of  Table  1 . 


Ml 

M2 

M3 

M4 

m5 

m6 

ntj 

t\ 

333304 

375636 

198220 

190694 

395173 

258818 

376568 

h 

442658 

400648 

346423 

181600 

289558 

323546 

380792 

h 

75696 

103564 

438703 

129944 

67881 

194194 

425543 

h 

194421 

392810 

582168 

248073 

178060 

267439 

611144 

t5 

466164 

424736 

503137 

325183 

193326 

241520 

506642 

665071 

687676 

578668 

919104 

795367 

390558 

758117 

tl 

177445 

227254 

72944 

139111 

236971 

325137 

347456 

*8 

32584 

55086 

127709 

51743 

100393 

196190 

270979 

*9 

311589 

r568804 

148140 

583456 

209847 

108797 

270100 

t\o 

314271 

113525 

448233 

201645 

274328 

248473 

170176 

hi 

272632 

268320 

264038 

140247 

110338 

29620 

69011 

*12 

489327 

393071 

225777 

71622 

243056 

445419 

213477 

Table  3.  A  high  task  heterogeneity  high  machine  heterogeneity  matrix  generated  by  the  range-based 
method  using  Rtask  and  Rmach  values  of  Table  1 . 


Ml 

M2 

m3 

M4 

M5 

m6 

M? 

*1 

2425808 

3478227 

719442 

2378978 

408142 

2966676 

2890219 

*2 

2322703 

2175934 

228056 

3456054 

6717002 

5122744 

3660354 

nr 

1254234 

3182830 

4408801 

5347545 

4582239 

6124228 

5343661 

*4 

22781 1 

419597 

13972 

297165 

438317 

23374 

135871 

h 

6477669 

5619369 

707470 

8380933 

4693277 

8496507 

7279100 

*6 

1113545 

1642662 

303302 

244439 

1280736 

541067 

792149 

ti 

2860617 

161413 

2814518 

2102684 

8218122 

7493882 

2945193 

h 

1 744479 

H 623574 

1516988 

5518507 

2023691 

3527522 

1181276 

*9 

6274527 

1022174 

3303746 

7318486 

7274181 

6957782 

2145689 

*10 

1025604 

694016 

169297 

193669 

1009294 

1117123 

690846 

tn 

2390362 

1552226 

2955480 

4198336 

1641012 

3072991 

3262071 

tn  1  96699  | 

882914 

63054 

199175 

894968 

248324 

297691 
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Table  4.  A  low  task  heterogeneity  low  machine  heterogeneity  matrix  generated  by  the  range-based 
method  using  Rtask  and  Rmach  values  of  Table  1. 


m\ 

m2 

m 

m5 

m6 

mi 

tl 

22 

21 

6 

16 

15 

24 

13 

h 

7 

46 

5 

28 

45 

43 

31 

*3 

64 

83 

45 

23 

58 

50 

38 

u 

53 

56 

26 

42 

53 

9 

58 

t 5 

11 

12 

14 

7 

8 

3 

14 

t6 

33 

31 

46 

25 

23 

39 

10 

tl 

24 

11 

17 

14 

25 

35 

4 

t% 

20 

17 

23 

4 

3 

18 

20 

t9 

13 

28 

14 

7 

34 

6 

29 

*10 

2 

5 

7 

7 

6 

3 

7 

til 

16 

37 

23 

22 

23 

12 

44 

*12 

8 

66 

47 

11 

47 

55 

56 

Table  5.  A  low  task  heterogeneity  high  machine  heterogeneity  matrix  generated  by  the  range-based 
method  using  RtaSk  and  Rmach  values  of  Table  1. 


m\ 

m2 

m3 

m5 

me  ■ 

mi 

ti 

440 

762 

319 

532 

151 

652 

308 

*2 

459 

205 

457 

92 

92 

379 

60 

h 

499 

263 

92 

152 

75 

18 

128 

*4 

421 

362 

347 

194 

241 

481 

391 

h 

276 

636 

136 

355 

338 

324 

255 

t6 

89 

139 

37 

67 

9 

53 

139 

tl 

404 

H3I 

257 

208 

539 

*8 

49 

Ha 

w*tm 

■5E1 

39 

36 

t9 

59 

HU 

IkEI 

WEI 

287 

277 

*10 

7 

235 

330 

56 

78 

t\\ 

716 

dill 

wm 

299 

144 

457 

*12 

435 

6 

394 

419 
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<a>  (b) 


a=2,  P=16 


a=2,  p=32 


(c) 


Figure  2.  Gamma  probability  density  function  for  (a)  a  =  2 ,0  =  8,  (b)  a  =  8,  0  =  8,  (c)  a  =  2,0 

and  (d)  a  =  2,  0  =  32. 


(1)  &task  =  1  /Vtask^i  &mach  —  1  Ismael?  i 
P  fas*  —  Ptask/^task 

(2)  for  *  from  0  to  (r  -  1) 

(3)  q[l\  :=  Gfataski  Pza.v/c) 

r  q[i]  will  be  used  as  mean 
of  i-th  row  of  ETC  matrix  7 

(4)  Pmflc/i  [*]  =  tf[*1/amac/i 

/*  scale  parameter  for  z-th  row  7 

(5)  for;  from  0 to (m-  1) 

(6)  ^[z,/]  =  G(amach,  $mach[l\) 

(7)  endfor 

(8)  endfor 

Figure  3.  The  general  CVB  method  for  gener¬ 
ating  ETC  matrices. 


the  effect  on  the  distribution  caused  by  an  increase  in  the 
scale  parameter  from  8  to  16  to  32.  The  two-fold  increase  in 
the  scale  parameter  does  not  change  the  shape  of  the  graph 
(the  curve  is  still  biased  to  the  left);  however  the  curve  now 
has  twice  as  large  a  domain  (i.e.,  range  on  x-axis). 

The  gamma  distribution’s  characteristic  parameters,  a 
and  P,  can  be  easily  interpreted  in  terms  of  ptask,  Vtask , 
and  Vmach •  For  a  gamma  distribution,  a  -  P\/a  ,  and 
fj  =  pa,  so  that  V  =  a/p  =  1  /y/a  (and  a  =  1/V2).  Then 
a  task  ^  1  /Vtask^  and  ( Xmach  ~  ^ /Vmach  •  Further,  because 
p  =  Pa,  p  =  /i/a,  and  jjW  =  /w/°W-  Also,  for  task  z, 

Pfflflcftp]  “  Q  p]  / &mach  • 

Let  G(a,  P)  be  a  number  sampled  from  a  gamma  dis¬ 
tribution  with  the  given  parameters.  (Each  invocation  of 
G(a,  p)  returns  a  new  sample.)  Figure  3  shows  the  general 
procedure  for  the  CVB  ETC  generation. 

Given  the  three  input  parameters,  Vtask,  Vmach,  and  Mask, 
Line  (1)  of  Figure  3  determines  the  shape  parameter  atask 
and  scale  parameter  p task  of  the  gamma  distribution  that  will 
be  later  sampled  to  build  the  task  vector  q .  Line  (1)  also 
calculates  the  shape  parameter  a mach  to  use  later  in  Line  (6). 
In  the  z-th  iteration  of  the  outer  for  loop  (Line  2)  in  Figure 
3,  a  gamma  distribution  with  parameters  atask  and  p task  is 
sampled  to  obtain  q[i\.  Then  q[i]  is  used  to  determine  the 
scale  parameter  p mach[i\  (to  be  used  later  in  Line  (6)).  For 
the  z-th  iteration  of  the  outer  for  loop  (Line  2),  each  iteration 
of  the  inner  for  loop  (Line  5)  produces  one  element  of  the  z- 
th  row  of  the  ETC  matrix  by  sampling  a  gamma  distribution 
with  parameters  a mach  and  P mach[i\-  One  complete  row  of 
the  ETC  matrix  is  produced  by  m  iterations  of  the  inner  for 
loop  (Line  5).  Note  that  while  each  row  in  the  ETC  matrix 
has  gamma  distributed  execution  times,  the  execution  times 
in  columns  are  not  gamma  distributed. 

The  ETC  generation  method  of  Figure  3  can  be  used  to 
generate  high  task  heterogeneity  high  machine  heterogene¬ 


0)  &task  —  1  /Vtask2,  &mach  —  ^ /Vmach  » 

P  mach  =  Vmach  /  ®~mach 

(2)  for ;  from  0  to  (w  -  1) 

(3)  p[j\  “  G(amach,  Pw«r/j) 

r  p[j \  will  be  used  as  mean 
of  j-th  column  of  ETC  matrix  V 

(4)  P 'task[j]  =  pUy&task 

r  scale  parameter  for  ;-th  column  */ 

(5)  for  i  from  0  to  (/  -  1) 

(6)  e[ii  j\  =  G(cttaski  Pz^[/]) 

(7)  endfor 

(8)  endfor 

Figure  4.  The  CVB  method  for  generating  low 
task  heterogeneity  high  machine  heterogene¬ 
ity  ETC  matrices. 


ity  ETC  matrices,  high  task  heterogeneity  low  machine  het¬ 
erogeneity  ETC  matrices,  and  low  task  heterogeneity  low 
machine  heterogeneity  ETC  matrices,  but  cannot  generate 
low  task  heterogeneity  high  machine  heterogeneity  ETC 
matrices.  To  satisfy  the  heterogeneity  quadrants  of  Section 
2,  each  column  in  the  final  low  task  heterogeneity  high  ma¬ 
chine  heterogeneity  ETC  matrix  should  reflect  the  low  task 
heterogeneity  of  the  “parent”  task  vector  q.  This  condition 
would  not  necessarily  hold  if  rows  of  the  ETC  matrix  were 
produced  with  a  high  machine  heterogeneity  from  a  task 
vector  of  low  heterogeneity.  This  is  because  a  given  col¬ 
umn  may  be  formed  from  widely  different  execution  time 
values  from  different  rows  because  of  the  high  machine  het¬ 
erogeneity.  That  is,  any  two  entries  in  a  given  column  are 
based  on  different  values  of  q[i\  and  a ,mch>  and  may  there¬ 
fore  show  high  task  heterogeneity  as  opposed  to  the  intend¬ 
ed  low  task  heterogeneity.  In  contrast,  in  a  high  task  het¬ 
erogeneity  low  machine  heterogeneity  ETC  matrix  the  low 
heterogeneity  among  the  machines  for  a  given  task  (across 
a  row)  is  based  on  the  same  q[i]  value. 

One  solution  is  to  generate  what  is  in  effect  a  transpose 
of  a  high  task  heterogeneity  low  machine  heterogeneity  ma¬ 
trix  to  produce  a  low  task  heterogeneity  high  machine  het¬ 
erogeneity  one.  The  transposition  can  be  built  into  the  pro¬ 
cedure  as  shown  in  Figure  4.  The  procedure  in  Figure  4  is 
very  similar  to  the  one  in  Figure  3.  The  input  parameter 
ptask  is  replaced  with  pmach.  Here,  first  a  machine  vector,  p, 
(with  an  average  value  of  Pmach)  is  produced.  Each  element 
of  this  “parent”  machine  vector  is  then  used  to  generate  one 
low  task  heterogeneity  column  of  the  ETC  matrix,  such  that 
the  high  machine  heterogeneity  present  in  p  is  reflected  in 
all  rows.  This  approach  for  generating  low  task  heterogene¬ 
ity  high  machine  heterogeneity  ETC  matrices  can  also  be 
used  with  the  range-based  method. 
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Tables  6  through  1 1  show  some  sample  ETC  matrices 
generated  using  the  CVB  ETC  generation  method.  Tables  6 
and  7  both  show  high  task  heterogeneity  low  machine  het¬ 
erogeneity  ETC  matrices.  In  both  tables,  the  spread  of  the 
execution  time  values  in  columns  is  higher  than  that  in  rows. 
The  ETC  matrix  in  Table  7  has  a  higher  task  heterogeneity 
(higher  Vtask)  than  the  ETC  matrix  in  Table  6.  This  can  be 
seen  in  a  higher  spread  in  the  columns  of  matrix  in  Table  7 
than  that  in  Table  6. 

Tables  8  and  9  show  high  task  heterogeneity  high  ma¬ 
chine  heterogeneity  and  low  task  heterogeneity  low  ma¬ 
chine  heterogeneity  ETC  matrices,  respectively.  The  exe¬ 
cution  times  in  Table  8  are  widely  spaced  along  both  rows 
and  columns.  The  spread  of  execution  times  in  Table  9  is 
smaller  along  both  columns  and  rows,  because  both  Vtask 
and  Vmach  are  smaller. 

Tables  10  and  1 1  show  low  task  heterogeneity  high  ma¬ 
chine  heterogeneity  ETC  matrices.  In  both  tables,  the 
spread  of  the  execution  time  values  in  rows  is  higher  than 
that  in  columns.  ETC  matrix  in  Table  1 1  has  a  higher  ma¬ 
chine  heterogeneity  (higher  Vmach)  than  the  ETC  matrix  in 
Table  10.  This  can  be  seen  in  a  higher  spread  in  the  rows  of 
matrix  in  Table  1 1  than  that  in  Table  10. 

3.3.  Uniform  Distribution  in  the  CVB  Method 

The  uniform  distribution  could  also  be  used  for  the  CVB 
ETC  generation  method.  The  uniform  distribution’s  charac¬ 
teristic  parameters  a  (lower  bound  for  the  range  of  values) 
and  b  (upper  bound  for  the  range  of  values),  can  be  easily 
interpreted  in  terms  of  jutash  Vtask,  and  Vmach.  (Recall  that 
Vtask  =  Vtask! Vtask  and  Vmach  =  Vmach/ Vmach)-  For  a  uniform 
distribution,  a=  {b  -  o)/V\2  and  jj  =  (b  +  a)/2[\5].  So 
that 

a  +  b  =  2iJ  (1) 

a-b~  (2) 

Adding  Equations  (1)  and  (2), 

a  —  fx  —  c\/3  (3) 

a  =  tj(l-(a/fi)V3)  (4) 

a=n{  l-Vy/l)  (5) 

Also, 

b-2fi- a  (6) 

The  Equations  (5)  and  (6)  can  be  used  to  generate  the  task 
vector  q  from  the  uniform  distribution  with  the  following 
parameters: 

&task  — Vtask{  1  Vfasky/3)  (7) 

bfask  —  2 Vtask  &task  (8) 


Once  the  task  vector  q  has  been  generated,  the  i-th  row  of 
the  ETC  matrix  can  be  generated  by  sampling  (m  times)  a 
uniform  distribution  with  the  following  parameters: 


amach  ~  #[(1(1  Vmachy/ 3) 

(9) 

bmach  —  2*7  [*]  ~  dmach 

(10) 

The  CVB  ETC  generation  using  the  uniform  distribu¬ 
tion,  however,  places  a  restriction  on  the  values  of  Vtask  and 
Vmach •  Because  both  atask  and  amack  have  to  be  positive,  it 
follows  from  Equations  (7)  and  (9)  that  the  maximum  value 
for  Vmach  or  Vtask  is  l/y/3.  Thus,  for  the  CVB  ETC  gen¬ 
eration,  the  gamma  distribution  is  better  than  the  uniform 
distribution  because  it  does  not  restrict  the  values  of  task  or 
machine  heterogeneities. 

3.4.  Producing  Consistent  ETC  Matrices 

The  procedures  given  in  Figures  1,  3,  and  4  produce 
inconsistent  ETC  matrices.  Consistent  ETC  matrices  can 
be  obtained  from  the  inconsistent  ETC  matrices  generated 
above  by  sorting  the  execution  times  for  each  task  on  all 
machines  (i.e.,  sorting  the  values  within  each  row  and  do¬ 
ing  this  for  all  rows  independently).  From  the  inconsistent 
ETC  matrices  generated  above,  partially-consistent  matri¬ 
ces  consisting  of  an  i  x  k  sub-matrix  could  be  generated  by 
sorting  the  execution  times  across  a  random  subset  of  A:  ma¬ 
chines  for  each  task  in  a  random  subset  of  i  tasks. 

It  should  be  noted  from  Tables  10  and  1 1  that  the  greater 
the  difference  in  machine  and  task  heterogeneities,  the  high¬ 
er  the  degree  of  consistency  in  the  inconsistent  low  task  het¬ 
erogeneity  high  machine  heterogeneity  ETC  matrices.  For 
example,  in  Table  1 1  all  tasks  show  consistent  execution 
times  on  all  machines  except  on  the  machines  that  corre¬ 
spond  to  columns  3  and  4.  As  mentioned  in  Section  1,  these 
degrees  and  classes  of  mixed-machine  heterogeneity  can  be 
used  to  characterize  many  different  HC  environments. 

4.  Analysis  and  Synthesis 

Once  the  actual  ETC  matrices  from  a  real  life  scenario 
are  obtained,  they  can  be  analyzed  to  estimate  the  prob¬ 
ability  distribution  of  the  execution  times,  and  the  values 
of  the  model  parameters  (i.e.,  Vtaskf  Vmach ,  and  ytask  (or 
Vmach  j  if  a  low  task  heterogeneity  high  machine  heterogene¬ 
ity  ETC  matrix  is  desired))  appropriate  for  the  given  real 
life  scenario.  The  above  analysis  could  be  carried  out  using 
common  statistical  procedures  [9].  Once  a  model  of  a  par¬ 
ticular  HC  system  is  available,  the  effect  of  changes  in  the 
workload  (i.e.,  the  tasks  arriving  for  service  in  the  system) 
and  the  system  (i.e.,  the  machines  present  in  the  HC  system) 
can  be  studied  in  a  controlled  manner  by  simply  changing 
the  parameters  of  the  ETC  model. 


193 


Table  6.  A  high  task  heterogeneity  low  machine  heterogeneity  matrix  generated  by  the  CVB  method. 

Vtask  ^  0.3 ,  Vmach  =  0.1. 


m\ 

1112 

/?*3 

III4 

m$ 

m 

ni'i 

m  8 

nig 

»M0 

<1 

628 

633 

748 

558 

743 

684 

740 

692 

593 

554 

h 

688 

712 

874 

743 

854 

851 

701 

701 

811 

864 

h 

965 

1029 

1087 

1020 

921 

825 

1238 

934 

928 

1042 

U 

891 

866 

912 

896 

776 

993 

875 

999 

919 

860 

h 

1844 

1507 

1353 

1436 

1677 

1691 

1508 

1646 

1789 

1251 

h 

1261 

1157 

1193 

1297 

1261 

1251 

1156 

1317 

1189 

1306 

tl 

850 

928 

780 

1017 

761 

900 

998 

838 

797 

824 

1042 

1291 

1169 

1562 

1277 

1431 

1236 

1092 

1274 

1305 

*9 

1309 

1305 

1641 

1225 

1425 

1280 

1388 

1268 

1290 

1549 

t\ 0 

881 

865 

752 

893 

883 

813 

892 

805 

873 

915 

Table  7.  A  high  task  heterogeneity  low  machine  heterogeneity  matrix  generated  by  the  CVB  method. 

Vtask  —  0.5,  Vmach  —  0.1. 


m\ 

r\%2 

m 

ni4 

ms 

me 

ms 

mg 

m\o 

*i 

377 

476 

434 

486 

457 

486 

431 

417 

429 
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*2 

493 

370 

400 

420 

502 

472 

475 

440 

483 

576 

*3 

745 

646 

922 

650 

791 

878 

853 

791 

756 
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*4 
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490 

469 

559 

488 

498 
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431 

547 

542 

'5 
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599 

522  1 
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*6 

921 
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939 
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tl 

677 
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797 

728 

941 
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686 

870 

ts 

428 

418 
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434 
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378 

427 

447 

466 

*9 

263 

289 

267 

231 

243 

222 

283 

257 

240 

247 

1 10 

1182 

1518 

1272 

1237 

1349 

1218 

1344 

1117 

1122 

1260 

*11 

1455 

1384 

1694 

1644 

1562 

1639 

1776 

1813 

1488 

1709 

t\2  1 

3255 

2753 

3289 

3526 

2391 

2588 

3849 

3075 

3664 

3312 

194 


Table  8.  A  high  task  heterogeneity  high  machine  heterogeneity  matrix  generated  by  the  CVB  method. 

Vtask  —  0.6,  Vmcich  ~  0.6. 


m\ 

ni2 
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ni4 

m5 

m6 

nt'j 

mg 

mg 

mo 

t\ 

1446 
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666 
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h 
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Em 
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mn 

E9 
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2114 
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wjmm 

KES1 

MM 
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196 
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355 
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Ei 
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235 
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127 

mm 

fig! 

h  i 

74 

■Ill 

71 

Em 

hi 

305 

226 
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554 
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Table  9.  A  low  task  heterogeneity  low  machine  heterogeneity  matrix  generated  by  the  CVB  method. 

Vtasfc  —  0.1,  Vrnach  =  0. 1 . 


MEUM 

mg 

mg 

mo 
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1009 
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1066 
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h 
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MM 
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947 
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mu 
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El 
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1006 
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1030 
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mm 
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h 
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860 
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h 
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*9 
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mm 
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E9 
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Table  10.  A  low  task  heterogeneity  high  machine  heterogeneity  matrix  generated  by  the  CVB  method. 

Vtask  “0.1,  V-rrxach  —  0.6. 


m\ 

m2 

m 
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ms 

m6 
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mg 
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I860 
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tl 
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h 
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U 
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h 
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764 
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'6 
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712 

ti 
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784 

h 
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774 

t9 
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Mo 
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t\\ 
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1454 
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721 
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641 

793 

t[2 
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890 

1137 

1812 

704 

800 

479 
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Table  11.  A  low  task  heterogeneity  high  machine  heterogeneity  matrix  generated  by  the  CVB  method. 

Vtask  =0.1,  Vmach  =  2.0. 


m2 

m3 

ni4 

m5 

W6 

mi 

m% 

mg 

"M0 

t\ 

4784 

326 

1620 

1307 

3301 

10 

103 

4449 

228 

40 

tl 

4315 

276 

1291 

1863 

3712 

ii 

91 

5255 

200 

47 

ti 

6278 

269 

1493 

1181 

3186 

12 

93 

4604 

235 
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14 
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96 
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10 

90 
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42 

h 
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365 
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12 

99 
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38 

t9 
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10 

94 

3739 
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42 

t\ 0 
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12 
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38 
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1363 
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12 

72 

4769 
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43 

t\2 

4621 
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1473 

1501 

3124 

12 

96 

4091 

199 

44 
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This  experimental  set-up  can  then  be  used  to  find  out 
which  mapping  heuristics  are  best  suited  for  a  given  set  of 
model  parameters  (i.e.,  Vtask^  Vmachi  aad  ptask  (or  Pmach))- 
This  information  can  be  stored  in  a  “look-up  table,”  so  as 
to  facilitate  the  choice  of  a  mapping  heuristic  given  a  set 
of  model  parameters.  The  look-up  table  can  be  part  of  the 
toolbox  in  the  mapper. 

The  ETC  model  of  Section  2  assumes  that  the  machine 
heterogeneity  is  the  same  for  all  tasks,  i.e.,  different  tasks 
show  the  same  general  variation  in  their  execution  times 
over  different  machines.  In  reality  this  may  not  be  true; 
the  variation  in  the  execution  times  of  one  task  on  all  ma¬ 
chines  may  be  very  different  from  some  other  task.  To  mod¬ 
el  the  “variation  in  machine  heterogeneity”  along  different 
rows  (i.e.,  for  different  tasks),  another  level  of  heterogeneity 
could  be  introduced.  For  example,  in  the  CVB  ETC  gener¬ 
ation,  instead  of  having  a  fixed  value  for  Vmach  for  all  the 
tasks,  the  value  of  Vmach  f°r  a  given  task  could  be  variable, 
e.g.,  it  could  be  sampled  from  a  probability  distribution. 
Once  again,  the  nature  of  the  probability  distribution  and 
its  parameters  will  need  to  be  decided  empirically. 

5.  Related  Work 

To  the  best  of  the  authors’  knowledge,  there  is  currently 
no  work  presented  in  the  open  literature  that  addresses  the 
problem  of  modeling  of  execution  times  of  the  tasks  in  an 
HC  system  (except  the  already  discussed  work  [13]).  How¬ 
ever,  below  are  presented  two  tangentially  related  works. 

A  detailed  workload  model  for  parallel  machines  has 
been  given  in  [4].  However  the  model  is  not  intended  for 
HC  systems  in  that  the  machine  heterogeneity  is  not  mod¬ 
eled.  Task  execution  times  are  modeled  but  tasks  are  as¬ 
sumed  to  be  running  on  multiple  processing  nodes,  unlike 
the  HC  environment  presented  here  where  tasks  run  on  sin¬ 
gle  machines  only. 

A  method  for  generating  random  task  graphs  is  given  in 
[17]  as  part  of  description  of  the  simulation  environment 
for  the  HC  systems.  The  method  proposed  in  [17]  assumes 
that  the  computation  cost  of  a  task  f,*,  averaged  over  all  the 
machines  in  the  system,  is  available  as  w/.  The  method 
does  provide  for  characterizing  the  differences  in  the  exe¬ 
cution  times  of  a  given  task  on  different  processors  in  the 
HC  system  (i.e.,  machine  heterogeneity).  The  “range  per¬ 
centage”  (|3)  of  computation  costs  on  processors  roughly 
corresponds  to  the  notion  of  machine  heterogeneity  as  p- 
resented  here.  The  execution  time,  etj,  of  task  U  on  machine 
mj  is  randomly  selected  from  the  range,  wj  x  (1  -  P/2)  < 
eij  <  wj x  (1  +  P/2).  However,  the  method  in  [17]  does  not 
provide  for  describing  the  differences  in  the  execution  times 
of  all  the  tasks  on  an  “average”  machine  in  the  HC  system. 
The  method  in  [17]  does  not  tell  how  the  differences  in  the 
values  of  wj  for  different  machines  will  be  modeled.  That 
is,  the  method  is  [17]  does  not  consider  task  heterogeneity. 


Further,  the  model  in  [17]  does  not  take  into  account  the 
consistency  of  the  task  execution  times. 

6.  Conclusions 

To  describe  different  kinds  of  heterogeneous  environ¬ 
ments,  an  existing  model  based  on  the  characteristics  of 
the  ETC  matrix  was  presented.  The  three  parameters  of 
this  model  (task  heterogeneity,  machine  heterogeneity, 
and  consistency)  can  be  changed  to  investigate  the  per¬ 
formance  of  mapping  heuristics  for  different  HC  systems 
and  different  sets  of  tasks.  An  existing  range-based 
method  for  quantifying  heterogeneity  was  described,  and  a 
new  coefficient-of-variation-based  method  was  proposed. 
Corresponding  procedures  for  generating  the  ETC  matrices 
representing  various  heterogeneous  environments  were 
presented.  Sample  ETC  matrices  were  provided  for  both 
ETC  generation  procedures.  The  coefficient-ofrvariation- 
based  ETC  generation  method  provides  a  greater  control 
over  the  spread  of  values  (i.e.,  heterogeneity)  in  any  given 
row  or  column  of  the  ETC  matrix  than  the  range-based 
method.  This  characterization  of  HC  environments  will 
allow  a  researcher  to  simulate  different  HC  environments, 
and  then  evaluate  the  behavior  of  the  mapping  heuristics 
under  different  conditions  of  heterogeneity. 
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Abstract 

We  present  an  asynchronous  Quasi-Monte  Carlo  (QMC) 
algorithm  for  numerical  integration  tailored  for  heteroge¬ 
neous  environments .  QMC  techniques  are  better  suited  for 
high  dimensions  than  adaptive  methods  and  have  generally 
better  convergence  properties  than  classical  Monte  Carlo 
methods. 

The  algorithm  focuses  on  the  asynchronous  computation 
of  randomized  lattice  (Korobov)  rules.  Whereas  the  individ¬ 
ual  rules  disallow  realistic  error  estimates ,  randomization 
provides  a  tool  for  giving  confidence  intervals  for  the  mag¬ 
nitude  of  the  error.  The  algorithm  generates  a  sequence  of 
stochastic  families,  using  an  increasing  number  of  points , 
for  the  purpose  of  automatic  termination. 

In  the  algorithm ,  each  each  randomized  rule  constitutes 
a  single  unit  of  work;  a  work  assignment  consists  of  a  set 
of  work  units.  Static  and  dynamic  load  balancing  strategies 
are  explored  to  keep  the  processors  busy  performing  use¬ 
ful  work  while  gradually  calculating  higher-level  families 
needed  to  reach  the  desired  accuracy.  We  present  results  in 
the  context  of  a  performance  model  for  parallel  programs 
executing  in  a  heterogeneous  environment. 


1.  Introduction 

Sampling  techniques  for  numerically  solving  integration 
problems  are  well  established,  and  are  particularly  useful 
when  solving  problems  of  high  dimensions.  While  syn¬ 
chronous  parallel  implementations  of  these  techniques  ap¬ 
pear  to  be  straightforward  on  tightly  coupled  parallel  ar¬ 
chitectures,  various  factors  push  for  an  asynchronous  so¬ 
lution  in  heterogeneous,  coarse-grained,  network  of  work¬ 
stations  (NOW)  architectures.  We  focus  on  the  design  and 
implementation  of  asynchronous  sampling  techniques  on 
these  architectures,  and  the  unique  difficulties  in  doing  so. 


This  work  is  based  on  the  sequential  implementation  by 
Genz  [12]. 

The  next  section  presents  background  information;  Sec¬ 
tion  3  introduces  Quasi-Monte  Carlo  techniques,  and  Sec¬ 
tion  4  discusses  the  algorithm.  A  performance  model  is  out¬ 
lined  in  Section  5  and  test  results  are  given  in  Section  6. 

2.  Background 

The  goal  is  to  calculate  an  approximate  answer  Q  to  the 
multivariate  integral  I  =  Jv  /(x)dx,  and  an  error  bound 
Ea  such  that  \I  -  Q\  <  Ea  <  e  =  max{ea,er|7|},  where 
ea  and  er  specify  absolute  and  relative  error  tolerances,  re¬ 
spectively.  The  integration  domain  T>  is  a  d-dimensional 
hyper-rectangular  region,  though  without  loss  of  generality 
we  assume  that  it  is  the  d-dimensional  unit  hypercube  Hd. 

When  d  is  small  (say,  d  <  10),  adaptive  partitioning 
methods,  (APM)  generally  work  well  for  finding  highly  ac¬ 
curate  solutions.  They  continually  divide  V  into  smaller 
subregions,  evaluating  each  with  a  quadrature  rule,  until  an 
answer  Q  of  the  desired  accuracy  e  is  reached.  We  have 
done  extensive  work  on  the  parallelization  of  adaptive  meth¬ 
ods  (see,  e.g.,  [6]),  leading  to  the  parallel  software  package 
ParIntI  .0  [7]. 

However,  when  d  is  large,  the  dimensional  effect  in  the 
required  number  of  evaluation  points  grows  to  an  unaccept¬ 
able  level  [4].  Indeed,  if  a  particular  1 -dimensional  problem 
requires  s  subdivisions  and  one  considers  a  d-dimensional 
problem  of  the  same  degree  of  difficulty  in  each  dimension, 
then  one  would  expect  to  need  about  sd  subdivisions  for 
the  d-dimensional  problem.  Sampling  techniques  such  as 
Monte  Carlo  (MC)  and  Quasi-Monte  Carlo  (QMC)  methods 
are  used  for  high  dimensions. 

A  previous  version  of  ParInt  [8]  allowed  for  multiple 
integration  problems  to  be  solved  in  parallel  by  dividing  the 
processors  into  groups,  with  each  group  solving  a  single  in¬ 
tegration  problem  by  a  parallel  APM.  In  this  version,  the 
groups  could  also  have  applied  QMC  methods  depending 
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on  the  type  of  problem.  This  hierarchical,  two-level  par¬ 
allelization  is  well-suited  for  problems  where  sets  of  high¬ 
dimensional  integrals  need  to  be  calculated,  such  as  in  the 
computational  finance  problem  treated  in  [17].  The  hierar¬ 
chical  system  has  a  natural  inclination  towards  a  heteroge¬ 
neous  implementation.  The  ability  to  easily  add  the  QMC 
technique  to  the  hierarchical  version  tightens  our  focus  on 
finding  a  heterogeneous  implementation  of  QMC  methods. 

Automatic  QMC  techniques  require  a  sequence  of  ap¬ 
proximations  over  all  of  %d  until  an  adequate  answer  is 
reached.  Each  rule  is  a  mean  of  function  values  and  can  be 
obtained  from  a  number  of  partial  sums.  Each  of  these  can 
be  calculated  independently,  so  they  form  a  natural  starting 
point  for  the  parallelization  of  the  task. 

To  calculate  the  approximations  in  sequence  using  paral¬ 
lel  processes  and  a  synchronous  accumulation  of  the  contri¬ 
butions  from  all  processes  to  each  element  of  the  sequence 
would  require  a  synchronization  point  for  each  sequence  el¬ 
ement.  On  a  heterogeneous  network  this  would  result  in  a 
large  communication  cost.  Furthermore,  on  a  network  there 
is  often  a  significant  lag  between  the  creation  of  the  first  and 
the  last  of  the  parallel  processes,  so  that  processes  which 
started  earlier  would  have  to  wait  for  the  later  ones.  This 
suggests  an  asynchronous  approach  for  updating  the  global 
results. 


where  7*  is  known  as  the  index  of  the  rules  [4],  for  /  € 
£k  (k  >  1),  which  is  the  class  of  all  functions  /,  periodic 
with  period  1  in  each  variable  and  for  which  the  Fourier 
coefficients  cm  satisfy 

|p»i  <0-* 

for  all  m  ^  0,  a  constant  C  >  0  and  rm  = 
^=1  max{l,|mi|}. 

In  view  of  the  periodicity  requirement,  a  periodizing 
transformation  is  applied  to  the  original  integral. 

Two  drawbacks  are  that  good  lattice  rules  are  hard  to  ob¬ 
tain  in  high  dimensions  and  that  the  error  is  hard  to  estimate. 

Richtmyer  rules  satisfy  E\-  =  0(JV_1)  for  /  € 
£k,  k  >  1.  We  have  I  =  RN  +  EN  where 

1  N 

RN  =  Jj'£f({ie1},...,{ied}) 

V  i=  1 

and  0i,02,---j0d  are  d  irrational  numbers  such  that 
1?  0i  >  02 ?  ■  •  ■  j  satisfy  Ao  +  Ai#i  +  •  •  •  +  A dfid  7^  0  for 
rational  A  coefficients  not  all  zero  [4]. 

Genz  [12]  uses  0*  =  y/¥i  where  7r*  is  the  i-th  prime,  and 
applies  the  Richtmyer  sequence  for  dimensions  >  20. 

4.  Algorithm 


3.  Quasi-Monte  Carlo  methods 


Basic  QMC  methods  evaluate  the  integrand  function  at 
a  set  of  points  calculated  deterministically,  as  opposed  to 
MC  which  evaluates  the  integrand  at  random  points.  QMC 
methods  are  classified  according  to  the  type  of  point  set 
used.  Lattice  rules  have  been  found  particularly  useful  for 
mid-range  dimensions  (say,  10  <  d  <  20).  Equidistributed 
point  sequences  have  been  used  effectively  in  higher  dimen¬ 
sions  as  well,  including  Richtmyer  rules  and  low  discrep¬ 
ancy  rules  such  as  Sobol’s  LPr  sequences. 

We  use  the  lattice  (Korobov)  and  Richtmyer  rules 
from  [  1 2]  for  low-to-moderate  and  for  high  dimensions,  re¬ 
spectively.  Their  convergence  properties  are  generally  bet¬ 
ter  than  the  0(1 /VN)  rate  of  classic  MC,  where  N  is  the 
number  of  points  used. 

Let  I  —  Kn  +  En  be  the  d~ variate  integral  of  /  over  Hd 
with 


4v»'  <» 

i=l 

where  {2;}  =  the  fractional  part  of  x  and  v  is  a  predeter¬ 
mined  generator  vector  with  integer  coefficients.  The  er¬ 
ror  for  a  sequence  of  lattice  rules  (1)  for  increasing  N  = 
Nq,Ni,...  ,  satisfies 


En  =  0( 


(log  N)*' 
Nk  h 


Cranley  and  Patterson  [3]  randomize  (1)  to  obtain  a 
stochastic  family  of  rules.  Let 

^)  =  ^E/(4v+/3})’  (2) 

2=1 

where  /3  is  a  uniformly  distributed  random  vector.  Using  a 
random  sample  set  of  size  q , 

KN  =  \'£KN{(lj)  (3) 

q  i=i 

preserves  the  integration  properties  of  (1)  and  allows 
for  a  standard  error  estimation  by  a*  =  Ejsf  = 

5l£lT  ZU(K»(fy)-K")2- 

Our  algorithm  must  calculate  Kjv  values  for  succes¬ 
sively  larger  values  of  N  until  either  an  answer  is  found 
to  the  user-specified  accuracy  or  the  function  count  limit 
is  reached.  As  these  . . . ,  values  are  calcu¬ 

lated,  the  overall  result  is  calculated  as  the  weighted  sum 

Q  =  (Ei  £f)/(Ei  ~)  (where  KN.  is  weighted  with 

the  inverse  of  its  corresponding  squared  standard  error)  and, 
correspondingly,  E  =  1.0/ (^  ^-).  The  randomized  lat¬ 
tice  rules  in  3  can  be  written  as 

Kij  =  KjsfiOij))  j  =  1, . . .  ,  Ji  <  q,  (4) 
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where  a  single  k^  represents  a  single  unit  of  work  in  our 
algorithm.  The  K{j  values  are  calculated  by  the  worker  pro¬ 
cesses  and  applied  (asynchronously)  to  update  the  overall 
result  Kpji  and  error  by  the  controller.  It  is  thus  not  neces¬ 
sary  that  Ji  =  q  in  the  asynchronous  computation.  In  our 
implementation,  the  controller  can  optionally  also  function 
as  a  worker. 

To  alleviate  a  potential  erratic  generation  of  the  k^  table, 
a  successful  distributed  algorithm  will  attempt  to  gradually 
calculate  higher-level  rules  as  needed  to  reach  the  desired 
accuracy,  while  keeping  all  processors  busy  performing  use¬ 
ful  work.  Particularly  on  a  heterogeneous  network  of  work¬ 
stations,  in  which  the  various  processors  may  be  operating 
at  different  speeds  and  which  may  have  different  commu¬ 
nication  latencies  between  processors,  the  asynchronous  al¬ 
gorithm  must  perform  dynamic  load  balancing  to  keep  the 
processors  busy. 

Our  algorithm  fits  within  the  paradigm  of  scheduling 
tasks  within  a  distributed  system  [11].  An  important  char¬ 
acteristic  of  this  problem  is  that  we  do  not  know  in  ad¬ 
vance  when  termination  will  occur,  as  it  depends  upon  ei¬ 
ther  reaching  the  user’s  accuracy  or  the  function  count  limit. 
There  are  no  strict  precedences  among  the  tasks  (a  task  be¬ 
ing  the  calculation  of  a  value),  but  the  algorithm  will 
generally  perform  minimal  work  when  a  full  K^i  value  is 
known  before  beginning  to  calculate  Kwi+l . 

Initially,  all  workers  independently  and  statically  assign 
themselves  an  initial  task  for  a  low-level  rule  and  report 
their  answer  to  the  controller  via  an  update  message.  After 
that,  the  controller  dynamically  assigns  tasks  to  the  workers 
via  work  assignment  messages  as  the  updates  are  received. 

A  work  request  is  for  the  calculation  of  one  or  more 
values.  Since  higher  level  rules  require  more  function 
evaluations,  a  reasonable  approach  is  to  apportion  the  work 
into  pieces  of  similar  size  based  on  the  number  of  function 
points  required  to  complete  the  work.  A  single  work  request 
may  therefore  consist  of  a  set  of  Kij  for  multiple  values  of 
j  and  i .  However,  we  have  found  that  after  several  rows 
have  been  calculated,  and  for  the  computationally  intensive 
problems  with  which  we  are  experimenting,  it  is  enough  to 
assign  k^  cells  to  workers  one  at  a  time,  as  the  computation 
of  each  cell  can  require  a  significant  amount  of  work. 

Note  that  this  method  of  task  allocation  can  be  consid¬ 
ered  a  FIFO  work  assignment  [11],  as  the  statically  defined 
sequence  of  rules  (for  a  given  q  value  in  (3))  forms  an  im¬ 
plicit  queue  of  tasks  to  be  completed,  and  the  set  of  work 
assignments  for  a  worker  forms  its  own  FIFO  queue.  Once 
a  task  has  been  assigned  to  a  worker,  it  is  not  transferred. 

The  heterogeneity  of  the  target  parallel  platform  is  han¬ 
dled  automatically,  in  that  the  faster  workers  will  finish 
work  more  often,  and  therefore  are  assigned  more  work  than 
the  slower  processors. 


5.  Performance  on  a  heterogeneous  network 

In  this  section  we  will  discuss  a  model  (of  [2])  to  assess 
algorithm  efficiency  on  a  heterogeneous  network. 

Let  the  network  be  designated  by  v.  The  total  work,  W , 
is  assumed  constant.  The  work  is  split  up  overp  processors, 
processor  i  being  responsible  for  the  part  W{  of  the  total 
work,  i.e.,  W*  =  W.  Since  we  will  assume  a  partic¬ 
ular  ordering  on  the  processors  of  the  network,  we  refer  to 
the  p-processor  portion  of  the  network  by  vp. 

The  speed  of  a  processor  (for  the  application)  is  ex¬ 
pressed  as  work  performed  per  unit  of  time,  i.e,  the  speed 
Oi  of  processor  i  is  &i  =  WijTi ,  where  T*  is  the  sequential 
time  for  the  execution  of  Wi  on  processor  i.  Note  that  the 
time  Ti  is  proportional  to  the  number  of  units  of  work  Wj. 

The  total  speed  of  the  network  is  ai  f°r  cor“ 
responding  partitioning  of  the  work.  The  relative  network 
speed  Rp{vp)  with  respect  to  reference  processor  p  is  then 
defined  as 

1  p 

Rp(up)  —  ~  aii 
ap  <= i 

and  has  the  meaning  of  the  number  of  processors  equivalent 
to  the  reference  processor,  which  would  together  have  the 
same  total  speed  as  the  network  vp .  Furthermore,  Rp{vp) 
provides  a  bound  for  the  speedup  which  can  be  obtained  on 
the  network. 

Denoting  the  sequential  time  on  reference  processor  p 
by  T  =  j- ,  the  speedup  of  the  network  with  respect  to  the 
reference  processor  is  given  by 

T  W 

Sp(u)  =  TK)  ~  a„T{upy 

and  Sp{  up)  <  j?p(i/p).  Hence,  the  efficiency  of  the  network, 


Note  that  it  is  customary  to  use  the  fastest  processor  as 
the  reference  processor  in  defining  the  speedup  [16,  18]. 
Furthermore,  the  above  model  is  invalid  when  there  are  dif¬ 
ferent  types  of  work  (which  may  take  different  times  per 
unit  of  work  on  different  processors).  The  study  in  [16] 
shows  how  this  contributes  to  allowing  superlinear  speedup, 
for  their  decomposition  of  a  global  climate  model  on  a  het¬ 
erogeneous,  distributed  system. 

We  remark  that  Colombet  [2]  extends  his  model  by  split¬ 
ting  up  the  work  into  its  different  types,  W  =  ^2j= i  >  so 

that  a  cost  (time)  is  incurred  per  unit  work  of  type  j  on 
processor  i.  Denoting  the  number  of  units  of  type  j  work  on 
processor  i  by  u)ij ,  the  parallel  time  can  be  minimized  by 

solving  for  mmwT(up)  =  min  w  max?=i(£/=i  TijWij). 
While  imposing  further  that  T{vp)  =  Ti,  1  <  i  <  p ,  the 
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above  expression  becomes  T//  =  mmw(^/=1  Tpjwpj). 
Solving  the  optimization  problem  for  Wij  >  0  and  denoting 
the  optimal  solution  elements  by  «/?.,  the  work  assigned  to 

processor  i  by  W*  =  wij  and  corresponding  speed 
by  <Ji  =  W  * /Tfj,  a  bound  for  the  speedup  is  again  obtained 
by  Sp(i/p)  <  Rp{vp). 

6.  Experimental  results 

6.1.  Problem  class 

Preliminary  results  were  reported  in  [9]  using  static 
load  balancing  and  Korobov  rules  for  the  calcula¬ 
tion  of  the  multivariate  normal  distribution  function 
|E|-*(2ir)-*  /k  e~^xTS  lxc?x,  where  £  is  an  d  x  Asym¬ 
metric  positive  definite  covariance  matrix  and  one  or  more 
of  the  integration  limits  may  be  H-oo  or  — oo.  Good  accuracy 
and  speedup  were  obtained  on  our  LAN  of  Ultra- 10  Spare- 
stations  and  on  the  IBM-SP  machine  at  Argonne  National 
Laboratory. 

In  [10]  we  reported  timing  results  using  static  load  bal¬ 
ancing  on  a  homogeneous  network,  for  a  class  of  prob¬ 
lems  based  on  an  integral  1(f)  from  computational  fi¬ 
nance  [1,  15],  where  /(x)  has  a  Gaussian  weight  and  a  fac¬ 
tor  of  the  form  g(x)  = 

*  ((1  -  Wfc(x))  +wfc(x)cA)  n*=i(l  -tt>j(x)) 

/  J  T~[k  — 1/1  .  •  /  \\  ’ 

where  ij(x)  =  i0K$e*(Xl+-+x*),  K0  =  e~*2/2,  wk(x)  = 
Ki+K2  tan_1(ivr34(x)+A'4),  ck  =  X)”=o  (l+*o)_J,  and 
the  domain  of  integration  is  The  integral  represents  the 
current  value  of  a  security  backed  by  mortgages  of  length 
(d)  months  with  fixed  monthly  interest  rate  i0. 

In  order  to  map  the  infinite  domain  to  the  d-dimensional 
unit  hypercube  we  performed  the  transformation  Zk  = 
$(£*:),  k  =  1, . . .  ,  d,  where  is  the  univariate  normal 
distribution  function.  Note  that  the  transformation  absorbs 
the  Gaussian  weight.  The  resulting  integral  is  Jn  f(z)dz 
with 

f{z)=g{$~1{z i),... 

Using  the  constants  C  =  1,  i0  =  .007,  a  =  .02  and  K\  = 
•01,  K 2  =  -.005,  Ks  =  10,  Ka  =  .5  yields  Caflish  and 
Morokoff ’s  “nearly  linear”  problem. 

6.2.  Experimental  environment 

We  ran  our  tests  on  a  network  v  of  Sparc  workstations, 
using  10  Ultra-Sparc  10’s,  1  Sparc  20,  and  5  Sparc  5’s.  We 
took  an  Ultra  10  as  our  reference  machine  /?,  and  always 
considered  orderings  of  the  machines  in  decreasing  speed. 


Table  1.  Table  of  machines’  relative  perfor¬ 
mances 


Machine 

Time 

a/ap 

ap/a 

Ultra  Sparc  10 

123.69 

1.00 

1.00 

Sparc  20 

228.20 

1.84 

0.54 

Sparc  5 

571.91 

4.62 

0.22 

To  derive  speeds  relative  to  p,  we  timed  the  sequential  solv¬ 
ing  of  a  reference  problem  on  each  type  of  machine  (thus 
ensuring  a  benchmark  containing  a  similar  mix  of  instruc¬ 
tions  as  the  actual  algorithm  tested).  The  problem  was  the 
financial  problem  in  50  dimensions  to  10000  function  eval¬ 
uations.  Table  1  shows  the  results:  the  time  (in  seconds) 
for  each  machine  to  solve  the  reference  problem,  the  speed 
a  —  W/T  (using  a  reference  workload  W  of  1.0),  the 
relative  speed  cr/apj  and  the  inverse  of  the  relative  speed 

(ap/a). 

The  ratio  for  the  Sparc  5  is  4.62,  meaning  that  an  Ultra  10 
has  the  same  performance  as  4.62  Sparc  5’s,  or,  inversely, 
that  adding  a  Sparc  5  to  a  given  set  of  machines  adds  the 
power  of  about  one-fifth  of  an  Ultra  10. 

We  present  results  for  the  financial  problem  solved  to 
60  dimensions.  The  function  count  limit  was  set  to  30000 
evaluations,  with  q  (samples  per  lattice  rule)  set  to  10.  The 
error  tolerance  was  set  low  so  that  termination  occurred  by 
reaching  the  function  count  limit. 

6.3.  Speedup  and  efficiency  results 

The  speedup  Sp(vp)  and  ideal  speedup  Rp(vp)  was  cal¬ 
culated  for  1  <  p  <  16  as  per  Section  5.  The  graph  in  Fig¬ 
ure  1  shows  the  speedup  results.  As  the  processors  were  or¬ 
dered  from  fastest  to  slowest,  the  reduced  slope  in  the  ideal 
speedup  curve  represents  the  reduced  expectations  from  the 
slower  workstations.  Figure  2  shows  the  same  results  in  an 
efficiency  graph. 

From  the  efficiency  graph,  one  can  see  that  we  achieve 
an  efficiency  of  about  .70,  and  retain  that  efficiency  through 
about  13  workstations.  At  that  point,  we  are  using  10  fast 
workstations,  1  medium,  and  2  slow.  Beyond  that  point  we 
begin  to  lose  efficiency. 

We  believe  that  this  loss  in  efficiency  is  due  to  the  slower 
communication  to  and  from  the  slower  workstations.  The 
relative  speed  a i/ap  of  workstation  i  only  takes  into  ac¬ 
count  the  relative  computational  speed.  If  a  workstation 
has  slower  communication  hardware  and  software,  then  this 
workstation  will  work  even  slower  than  its  relative  speed 
can  account  for.  As  the  experimental  results  model  does  not 
take  this  into  account,  it  appears  as  a  loss  of  efficiency.  In 
these  models,  any  form  of  communication,  at  any  speed,  is 


203 


Efficiency  Speedup 


Graph:  Speedup  vs.  Number  of  Processors 


Figure  1.  Speedup  results 
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Figure  2.  Efficiency  results 
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taken  to  be  a  loss  of  efficiency,  as  the  comparison  is  always 
to  a  sequential  implementation  that  does  no  communication. 

Note  the  downward  dip  in  efficiency  and  speedup  at 
P  =  2.  We  determined  that  having  the  controller  participate 
in  the  work  resulted  in  problems.  With  this  option  on,  the 
controller  would  begin  to  calculate  a  Kij  value  when  there 
were  no  updates  to  receive.  When  a  task  was  completed, 
the  controller  would  again  look  for  updates.  This  worked 
fine  in  the  lower  rules,  but  as  the  rules  became  more  ex¬ 
pensive  to  calculate,  the  controller  could  work  for  extended 
periods  of  time  without  checking  for  any  updates.  The  re¬ 
sult  was  starvation  loss  (i.e.,  a  loss  in  efficiency),  where  the 
workers  would  be  waiting  idle  for  work,  and,  breaking  loss , 
where  the  update  that  would  allow  the  algorithm  to  termi¬ 
nate  would  be  waiting  on  an  incoming  message  queue  while 
the  controller  was  busy. 

The  fact  that  the  controller  is  not  working  is  the  cause 
of  the  downward  dip  in  the  graphs  at  p  =  2.  A  hybrid  ap¬ 
proach  is  probably  best,  allowing  for  the  controller  to  do 
smaller  amounts  of  work  at  a  time,  especially  for  small  val¬ 
ues  of  p,  and  allowing  it  to  devote  more  time,  as  necessary, 
to  controlling  the  workers  for  larger  values  of  p. 

Breaking  loss  also  forms  a  loss  of  efficiency,  regardless 
of  the  issue  of  the  controller  working.  As  higher  level  rules 
are  calculated,  each  rule  requires  more  and  more  points. 
If  execution  terminates  due  to  reaching  the  function  count 
limit,  then  the  final  function  count  will  be  generally  higher 
than  the  limit;  the  number  of  excess  function  values  will 
be  the  amount  required  to  finish  the  last  mj.  As  the  tasks 
require  more  points,  this  amount  increases. 

Also  note  that  our  use  of  a  non-dedicated  network  was 
detrimental  to  the  overall  efficiency. 

6.4.  Comparisons  with  adaptive  partitioning 

As  mentioned  in  Section  1,  adaptive  partitioning  tech¬ 
niques  suffer  exponentially  as  the  number  of  dimensions 
increases.  For  a  given  function  of  moderate  (say,  5  to  15)  di¬ 
mensions,  it  is  interesting  to  consider  whether  apm  or  qmc 
techniques  perform  better.  Table  2  shows  some  results  com¬ 
paring  these  techniques  for  the  financial  problem  at  10  di¬ 
mensions,  for  varying  amounts  of  requested  accuracy.  The 
£d  values  referenced  in  the  table  correspond  roughly  to  the 
number  of  digits  of  accuracy  requested.  The  function  count 
limit  was  set  to  3  million. 

The  table  indicates  that  neither  method  has  to  do  much 
work  initially.  The  QMC  technique  is  able  to  handle  tighter 
accuracies  before  going  over  the  function  limit.  The  APM 
blows  up  as  soon  as  any  partitioning  is  required.  Note 
that  given  the  large  number  of  points  required  per  cubature 
rule  application  (1265  points  at  10  dimensions  for  our  rules 
from  [13,  14]),  3  million  function  evaluations  corresponds 
to  only  about  2300  rule  evaluations. 


Table  2.  Table  of  number  of  function  evalua¬ 
tions  and  time  (in  seconds),  vs.  requested 
accuracy,  for  apm  and  qmc 


£d 

APM 

QMC 

Time 

Fen  Evals 

Time 

Fen  Evals 

1 

0.35 

1265 

0.21 

r  620 

2 

0.35 

1265 

0.21 

620 

3 

0.35 

1265 

0.21 

620 

4 

0.36 

1265 

0.21 

620 

5 

>3M 

0.21 

'  620 

6 

0.22 

620 

7 

2.83 

8740 

8 

58.66 

181000 

9 

133.41 

413000 

10 

>3M 

7.  Conclusions  and  future  work 

We  focused  on  the  asynchronous  computation  of  a  se¬ 
quence  of  stochastic  lattice  rule  families  using  dynamic  load 
balancing,  and  presented  test  results  on  a  heterogeneous 
network. 

Further  theoretical  work  and  experimentation  is  needed, 
e.g.,  regarding  the  number  of  entries  needed  in  a  family  for 
satisfactory  error  estimation.  We  also  need  to  assess  the 
quality  of  the  weighted  average  of  the  sequence  of  results 
and  error  estimates.  We  are  investigating  the  applicability 
of  LPr  sequences  to  deal  with  some  classes  of  singular  be¬ 
havior  [5].  Furthermore,  we  intend  to  experiment  with  mod¬ 
ifications  of  the  scheduling  strategy. 

Finally,  we  are  considering  the  incorporation  of  the 
asynchronous  qmc  methods  as  a  significant  addition  to 
Parlnt  [7]. 
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Abstract 

In  this  paper  we  present  a  scalable  scheduling  heuristic 
for  several  common  classes  of  multi-component  appli¬ 
cations  (meta-applications).  We  consider  this  schedul¬ 
ing  problem  in  a  wide-area  heterogeneous  computing 
environment ,  or  metasystem.  The  heterogeneity  and 
scale  of  the  computing  environment  and  the  heteroge¬ 
neity  of  the  application  make  this  a  challenging  prob¬ 
lem.  We  have  studied  the  performance  of  the  heuristic 
in  simulation  and  the  results  are  encouraging.  Comple¬ 
tion  times  for  three  common  classes  of  meta-applica¬ 
tions  were  within  10-20%  of  optimal  on  average  with  a 
worst-case  variance  of  60%.  The  results  suggest  that 
effective  scheduling  of  meta- applications  is  possible ,  if 
sufficient  application  and  system  resource  cost  infor¬ 
mation  is  provided1. 


1.0  Introduction 

Metacomputing  is  the  seamless  application  of  wide- 
area  distributed  computing  resources  to  user  applica¬ 
tions.  A  number  of  research  groups  are  building  low- 
level  metacomputing  infrastructure  [2][4],  What  distin¬ 
guishes  metacomputing  applications  from  other  wide- 
area  distributed  applications  is  the  objective  of  high- 
performance.  In  fact,  it  is  the  promise  of  performance 
greater  than  can  be  achieved  using  single  site  resources 
that  makes  metacomputing  most  attractive  in  solving 
complex  scientific  problems.  The  applications  most 
suitable  for  metacomputing  are  often  very  heterogene¬ 
ous  in  structure  [6].  For  instance,  such  applications  may 
include  remote  databases  or  servers,  remote  instru¬ 
ments,  remote  supercomputers,  VR  devices,  and 
humans-in-the-loop.  In  addition  to  hardware  heteroge¬ 
neity,  the  underlying  networks  that  connect  the  differ¬ 
ent  sites  may  also  exhibit  performance  heterogeneity  in 
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both  latency  and  bandwidth.  The  challenges  inherent  in 
heterogeneous  computing  are  well  known  [3]. 

Exploiting  the  performance  potential  in  metacom¬ 
puting  environments  requires  effective  application 
scheduling:  the  selection  and  allocation  of  resources  to 
the  application.  This  problem  is  particularly  challeng¬ 
ing  due  to  the  heterogeneous  nature  of  both  the 
resources  (machines  and  networks)  and  the  application 
itself.  In  this  paper,  we  consider  the  scheduling  of  meta¬ 
applications;  applications  consisting  of  multiple  sched- 
ulable  components  across  multiple  sites  with  the  goal  of 
reduced  completion  time.  Previous  work  showed  that 
selecting  the  best  single  site  for  single  component  par¬ 
allel  applications  can  be  done  efficiently  [8].  The 
scheduling  of  multiple  interacting  components  across 
multiple  sites  is  a  complex  problem  especially  when  the 
network  capacity  is  assumed  to  vary  between  sites. 
Most  of  the  metacomputing  scheduling  work  assumes 
either  communication  between  components  can  be 
ignored,  or  the  application  will  be  confined  to  run  in  a 
single  site,  or  the  number  of  sites  and  components  are 
small  enough  to  make  a  brute-force  scheduling  algo¬ 
rithm  feasible.  In  contrast,  we  present  a  scalable  sched¬ 
uling  heuristic  that  has  achieved  excellent  results 
(typically  within  10%  of  optimal  on  average)  in  a 
detailed  simulation  study  of  three  common  classes  of 
meta- applications. 

2.0  Meta-Application  Model 

The  underlying  network  contains  a  collection  of 
sites  connected  by  wide-area  networks  (Figure  1).  The 
network  sites  each  offer  an  amount  of  one  or  more 
resources  (cycles,  memory,  disk,  or  other  specialized 
resources)  and  have  a  point-to-point  bandwidth  to  each 
site  (bwj.j  that  may  be  different).  In  this  paper,  we  char¬ 
acterize  the  site  resources  and  network  capabilities 
solely  in  terms  of  their  delivered  performance  to  appli¬ 
cations  as  in  [1]. 

Meta-applications  consist  of  a  set  of  distinct  applica¬ 
tion  components  that  may  communicate  and  interact 
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over  the  course  of  the  application.  Components  may  be 
schedulable  computations,  remote  servers  or  databases, 
remote  instruments,  humans-in-the-loop,  etc.  (Figure 
2).  Computational  components  themselves  may  be 
sequential  or  parallel  computations.  Some  components 
are  fixed  such  as  a  remote  database  server  and  do  not 
require  scheduling,  but  may  impact  the  scheduling  of 
other  components.  For  example,  the  placement  of  com¬ 
ponent  B  may  be  influenced  by  the  amount  of  data 
transmitted  by  a  database  to  B.  If  a  great  deal  of  data  is 
moved,  then  a  high-speed  link  between  B  and  the  data¬ 
base  may  be  desired.  It  is  also  possible  that  other  com¬ 
ponents  are  fixed  due  to  scheduling  constraints,  such  as 
a  given  program  component  must  run  in  a  particular 
site. 

We  consider  meta-applications  in  which  the  inter¬ 
component  communication  pattern  is  statically  known 
(Figure  2)  and  divide  meta-applications  into  three  cate¬ 
gories  —  concurrent ,  parallel ,  and  pipeline  (Figure  3). 
Concurrent  is  the  classic  meta-application  in  which  a 


set  of  components  each  running  in  a  single  site  are  exe¬ 
cuting  concurrently  and  exchanging  data.  Examples 
include  global  climate  modelling  which  often  integrates 
several  large-scale  coupled  computational  models  [5]. 
A  parallel  application  is  a  special  case  of  concurrent  in 
which  a  component  might  benefit  by  distribution  across 
multiple  sites2.  This  normally  applies  to  large-scale  par¬ 
allel  applications  in  which  a  task  (or  subcomponent) 
may  be  replicated  an  arbitrary  number  of  times  with 
minimal  task  interaction  relative  to  task  computation. 
Very-coarse  grain  SPMD  computations  or  highly  com¬ 
pute-intensive  applications  such  as  large  parallel  simu¬ 
lations,  RSA  factoring,  are  examples  in  this  category. 
Finally,  pipeline  applications  consist  of  a  number  of 
components  connected  in  a  chain-like  fashion.  Exam¬ 
ples  include  certain  multi-disciplinary  optimization 
problems  in  which  the  outputs  of  one  program  are  often 


2.  This  is  not  to  be  confused  with  a  component  that  happens  to  be  a 
parallel  computation,  but  would  not  benefit  by  multi-site  distribution. 


Figure  2:  A  meta-application  consisting  of  five  components.  Three  computations  (A,  B,  C), 
a  database  server  and  a  VR  server.  The  presence  of  an  arc  indicates  data  communication. 
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c)  pipeline 


Figure  3:  Meta-application  classes. 


fed  into  another  [7].  In  addition,  applications  may  be 
hybrids  of  the  above  classes.  For  example,  the  applica¬ 
tion  may  be  pipelined,  but  one  of  the  components  may 
be  parallel.  In  this  work  we  consider  concurrent  and 
pipeline  applications  with  components  confined  to  a 
single  site.  Hybrid  and  parallel  applications  are  the  sub¬ 
ject  of  future  work.  Scheduling  meta- applications  to 
minimize  application  completion  time  requires  a  cost 
model  to  evaluate  the  potential  set  of  candidate  sched¬ 
ules.  These  cost  functions  require  information  about  the 
application.  For  each  application  component  Cj,  we 
assume  the  following  information: 

comp_amt  [Cj]  --  the  amount  of  computation  in  #  of 
instructions  to  be  executed,  Cj  schedulable 

commjzmt  [Cj,  Cj]  —  the  amount  of  communication 
(in  #  of  bytes  transmitted)  between  Cj,  Cj 

The  following  cost  functions  are  constructed  using 
this  information  and  system  resource  information  ( S  is  a 
site,  N  is  the  number  of  components,  n  is  the  number  of 
schedulable  components,  and  m  is  the  number  of  sites): 
comp  [Cj,  Sk]  for  all  (i  <  n),  (k  <  m) 
comm  [Cj,  Cj,  Sk,  Sj]  for  all  (i,  j  <  TV),  (k,  1  <  m) 
commjotal  [Cj,  Sk]  for  all  (i  <  n),  (k  <  m) 
initjime  [Cj,  Sk]  for  all  (i  <  n),  (k  <  m) 

The  function  comp  gives  the  computation  time  for 
component  Cj  running  in  site  Sk  given  the  current  state 
of  the  resources  Sk  is  willing  to  provide  to  the  applica¬ 
tion  assuming  no  other  component  is  scheduled  there.  If 
other  components  are  also  scheduled  in  Sk,  then  the 
comp  function  for  these  components  may  be  degraded 
to  reflect  the  sharing  of  Sk’s  resources. 

The  function  comm  gives  the  communication  time 
spent  by  Cj  in  communication  with  Cj  when  Cj  and  Cj 


are  run  in  Sk  and  Sj  respectively.  It  has  the  value  0,  if 
there  is  no  communication.  If  multiple  components  are 
sharing  a  link,  then  the  comm  function  for  these  compo¬ 
nents  may  be  degraded  to  reflect  the  sharing  of  the  link. 
The  total  communication  cost  for  a  component 
commjotal ,  is  a  function  of  the  comm  values  associ¬ 
ated  with  its  links.  In  the  simplest  case  it  would  be  the 
sum  of  these  costs,  but  in  other  situations  some  link 
communication  might  be  overlapped.  The  communica¬ 
tion  costs  will  vary  for  different  component  assign¬ 
ments  due  to  the  heterogeneity  of  the  underlying 
network.  For  example,  some  sites  might  be  vBNS-con- 
nected,  but  others  might  be  limited  to  standard  Internet 
connections.  The  function  initjime  includes  any  start¬ 
up  overhead  which  could  include  queue  time  for  Sk’s 
resources,  time  to  transmit  component  binaries  from  the 
initiating  site  if  they  are  not  already  located  at  Sk,  etc. 
For  simplicity,  we  will  assume  it  is  0  and  that  n~N  in 
the  remainder  of  the  paper. 

Numerous  research  groups,  including  ours,  are 
addressing  the  problem  of  producing  such  cost  func¬ 
tions  using  application  and  system  resource  information 
[1][9].  Here,  we  focus  on  the  problem  of  making  sched¬ 
uling  decisions  given  these  cost  functions  for  three 
classes  of  meta-applications:  pipelines  in  which  a  com¬ 
putation  stage  cannot  begin  until  the  prior  stage  has 
completed,  and  concurrent  applications  in  which  all 
component  computations  and  inter-component  commu¬ 
nication  are  either  overlapped  or  sequential.  The  meta¬ 
application  completion  time  Tct  can  be  defined  in 
terms  of  the  component  cost  functions  (where  i,  j,  k,  1 
range  over  the  values  above): 
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(1)  Tct  [pipeline]  =  X  comp  [Ci?  Sk]  +  commjotal  [Cj,  Cj,  Sk,  Sj] 

(2)  Tct  [concurrent,  overlapped]  =  max  {comp  [Cj,  Sk]„  commjotal  [Cj,  Cj,  Sk,  Sj] } 

(3)  Tct  [concurrent]  =  max  {comp  [Cj,  Sk]+  commjotal  [Cj,  Cj,  Sk,  Sj]} 


Other  cost  functions  are  possible  depending  on  the 
application.  For  example,  pipeline  computation  and 
communication  could  be  overlapped,  or  some  of  the 
components  within  the  application  may  have  over¬ 
lapped  computation  and  inter-component  communica¬ 
tion.  These  cases  are  the  subject  of  future  work. 

2.1  Scheduling  Heuristic 

The  scheduling  problem  is  to  determine  an  assign¬ 
ment  of  schedulable  components  to  site  resources  that 
minimizes  Tct.  Unfortunately,  this  scheduling  problem 
is  NP-complete.  An  exhaustive  search  is  not  feasible 
since  if  there  are  n  schedulable  components  and  m  sites, 
then  there  are  mn  possible  schedules  if  components  can 
be  co-located  and  single  components  cannot  span  mul¬ 
tiple  sites.  We  presume  that  while  n  may  be  small  in 
practice  (likely  less  than  10),  m  may  grow  as  metacom¬ 
puting  environments  achieve  large-scale  deployment. 
In  practice,  we  may  only  want  to  consider  a  subset  of 
sites  or  we  may  limit  m  to  include  sites  that  contain 
resources  specifically  requested  by  the  application. 


The  general  steps  of  the  scheduling  process  are  the 
following: 

1:  determine  candidate  sites;  for  each  site 
2:  a)  collect  available  resources  from  the  site 
b)  compute  cost  function  estimates  for  each  com¬ 
ponent  from  the  site 

3:  run  scheduling  heuristic  to  search  for  best  compo¬ 
nent/site  assignment 

Two  scheduling  heuristics  have  been  developed,  one 
suitable  for  compute-intensive  meta-applications  and 
the  other  for  data-intensive  meta-applications  (a  newly 
emerging  class  of  meta-applications).  In  the  former 
class,  component-site  scheduling  is  most  critical,  while 
for  the  latter,  link  scheduling  (i.e.  the  link  capacity  is 
considered  first)  becomes  important. 


TABLE  1 .  Simulation  Parameters 


name 

value  range/units 

comments 

num  sites 

3..  10 

this  size  covers  today’s  testbeds 

comp_rate 

[1,  10000]  MIPS 

covers  weak  to  powerful  sites 

intrajink_rate 

[500,  1000]  KBps  -  fixed  value 

~  ethernet  speed 

inter„link_rate 

[50,  100],  [100,  200],  [500,  10000]  KBps 

slow  Internet  up  to  vBNS-like 

link_variance 

1,2,5,  10 

reflects  T  network  heterogeneity 

affinity 

1,2,5,  10 

reflects  T  component/site  affinity 

appLtype 

concurrent  w/wo  overlap,  pipeline 

num__components 

3..  8 

8  should  cover  most  meta-apps 

comp_amt 

[10000,  100000],  [100000,  100000], 

[1000000,  10000000]  Mlnstructions 

comp_  and  comm_amt  ranges  cover 
medium-coarse  grain  apps 

comm_amt 

[1,  10000],  [1,  100000],  [1,  100000]  Bytes 

col_degrade 

[0,  1  ]  —  real  interval 

0  =  no  degradation,  1  =  linear 
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Num  components 


Figure  4:  Performance  of  scheduling  heuristic 


Below  we  present  the  compute-intensive  heuristic. 

This  heuristic  considers  components  in  decreasing 
order  of  computational  weight: 

1.  order  application  components  by  decreasing 
comp_amt  value 

2.  init  PLACEMENT[Ci]=  -1  for  all  components  Q 

3.  for  each  component  Cj 

4.  for  each  site  Sk 

5.  compute  Tct  with  PLACEMENT[Cj]=Sk, 

given  PLACEMENT[Cj],  j<i,  unchanged 

(use  Eq.  1-3  to  compute  Tct  based  on  application 

type) 

6.  remember  best  Tct  and  best  associated  site 

^best 

7.  end  for 

8.  PLACEMENT[Cj]=Sbest 

9.  end  for 

This  is  a  greedy  algorithm  with  complexity  O (mn2). 
There  are  mn  schedules  and  n  steps  to  evaluate  TCt  ( n 
links  per  component).  We  consider  this  heuristic  to  be 
scalable  as  n  will  be  small  in  practice.  It  finds  a  single 
best  candidate.  The  computation  of  TCT  in  step  5  is 
based  on  an  assignment  of  components  up  to  the  current 


component.  The  comp  and  comm  terms  for  unassigned 
components  are  set  to  0  in  this  calculation. 

3.0  Initial  Results 

We  have  run  the  scheduling  heuristic  over  a  large  set 
of  simulated  metacomputing  environments  and  meta¬ 
applications,  and  measured  the  performance  of  the  heu¬ 
ristic  with  respect  to  the  optimal  schedule.  The  simu¬ 
lated  metacomputing  environment  consists  of  a  number 
of  interconnected  sites  that  provide  a  particular  compu¬ 
tation  and  communicate  rate.  The  communication  rate 
between  the  different  sites  is  also  specified.  A  link  vari¬ 
ance  parameter  is  used  to  vary  to  inter-site  communica¬ 
tion  performance  allowing  us  to  simulate  a  truly  uneven 
network. 

Meta-applications  consists  of  a  type  and  a  number  of 
components.  For  each  component  we  have  an  amount 
of  computation  and  an  amount  of  communication  to  all 
other  components.  These  parameters  are  varied  to  allow 
us  to  simulate  a  wide  range  of  application  granularities. 
We  also  provide  an  affinity  parameter  which  is  used  to 
bias  particular  components  to  particular  sites.  From  this 
parameter,  we  generate  an  affinity  value  for  each  com¬ 
ponent  site  pair.  This  allows  us  to  simulate  environ¬ 
ments  with  differing  degrees  of  heterogeneity.  This 
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parameter  is  used  to  adjust  the  comp  value.  Finally, 
when  components  are  collocated  to  particular  sites,  we 
use  a  parameter  to  degrade  the  sites  computing  power 
to  all  hosted  components.  This  value  ranges  from  no 
effect  (e.g.  perhaps  the  site  has  ample  resources  for  all 
components)  to  a  degradation  linear  in  the  number  of 
hosted  components.  Currently,  we  do  not  degrade  inter¬ 
site  link  performance  in  the  event  that  a  link  is  shared 
since  we  do  not  yet  have  empirical  data  for  this  parame¬ 
ter.  For  each  schedule  to  be  evaluated,  the  simulator  can 
easily  construct  comp ,  comm_total ,  and  Tcj  from  these 
parameter  values.  The  function  commjotal  is  defined 
to  be  the  sum  of  the  individual  link  communication 
costs  for  a  component. 

Realistic  values  are  chosen  for  the  parameter  ranges 
(Table  1).  For  example,  the  intra-site  link  rate  is  based 
on  a  typical  ethernet  LAN,  while  the  inter-site  rate 
ranges  from  Internet  WAN  speeds  up  to  reported  vBNS 
rates.  For  parameters  that  are  specified  as  ranges,  a  ran¬ 
dom  value  uniformly  distributed  over  the  range  is  gen¬ 
erated  for  each  simulation  run. 

The  results  that  suggest  that  the  heuristic  performs 
very  effectively  over  the  simulated  parameter  ranges 
for  the  three  application  classes.  A  simulation  study  of 
over  800,000  distinct  environment  and  application 
instances  was  performed.  For  each  run,  we  execute  the 
heuristic  and  search  for  the  optimal  schedule.  We  use 
several  comparison  metrics:  how  close  the  heuristic  is 
to  the  optimal  on  average  and  the  maximum  variance 


from  optimal.  The  first  set  of  results  are  depicted  in  Fig¬ 
ure  4.  The  heuristic  performs  best  on  the  application 
classes  in  the  following  order:  pipeline,  concurrent,  and 
concurrent-overlap.  This  order  reflects  increasing  sensi¬ 
tivity  of  the  overall  completion  time  to  the  scheduling 
of  a  single  component.  Since  the  completion  time  of 
pipelines  are  a  sum  of  all  components,  the  heuristic  can 
afford  to  schedule  a  few  components  sub-optimally.  For 
the  other  classes,  suboptimal  scheduling  of  a  single 
component  can  have  a  larger  impact  on  the  completion 
time.  However,  the  heuristic  performs  quite  well  for  all 
application  classes.  When  the  number  of  components  is 
<  5,  the  heuristic  is  within  10%  of  optimal  on  average. 
In  general,  it  is  within  20%  on  average.  The  heuristic 
also  appears  to  be  insensitive  to  the  heterogeneity  of  the 
network  environment;  performance  is  fairly  flat  for 
changes  in  link_variance  and  affinity  parameters.  The 
heuristic  also  performs  well  as  the  amount  of  computa¬ 
tion  and  communication  varies.  Pipeline  applications 
are  insensitive  to  these  parameters,  while  the  other 
application  classes  exhibit  greater  sensitivity.  The  heu¬ 
ristic  is  also  sensitive  to  the  number  of  sites  in  the  envi¬ 
ronment,  but  exhibits  slow  degradation  as  the  number 
of  sites  increases.  The  second  set  of  results  indicate  that 
the  worst-case  variance  from  optimal  is  within  60%  for 
all  parameter  ranges,  and  typically  within  30-40% 
(Table  2). 


Table  2.  Maximum  variance  for  heuristic.  Shown  for  each  parameter  value  and  each  application  class. 


Parameter 

Max  Variance  for  appl  classes  [%  drift  from  optimal]: 

(pipeline,  concurrent  overlap,  concurrent) 

num  sites 

3:  (21 .7, 36.5,  26.8),  4:  (24.5, 41.9, 28.2),  5:  (25.7, 42.7,  28.5),  6:  (25.4, 45.0, 30.0), 

7:  (25.6, 43.4,  28.3),  8:  (27.1, 47.5,  25.5),  9:  (28.4, 53.5,  28.5),  10:  (32.4, 42.6,  28.6) 

num_components 

3:  (19.2, 19.4, 12.2),  4:  (23.3,  32.9,  22.6),  5:  (23.7, 40.4,  27.4),  6:  (25.8, 49.1, 32.8), 

7:  (27.0,  50.7,  25.6),  8:  (27.0,  57.1,  38.0) 

comp_amt 

rl:  (25.8,  44.5,  27.9),  r2:  (22.8,  38.3,  28.3),  r3:  (27.8,  40.5, 47.1) 

comm__amt 

rl:  (21.7,  36.5,  3.1),  r2:  (21.3,  35.2,  25.8),  r3:  (29.7,  52.6,  25.4) 

link_variance 

1:(25.0, 40.9,  23.4),  2:  (25.5,  41.9,  27.0),  5:  (23.5,42.1,  29.9),  10:  (23.1, 40.8,  32.2) 

affinity 

1:(14.2, 42.1,  28.2),  2:  (21.0,  41.5,  27.8),  5:  (29.1, 40.2,  28.0),  10:  (32.7,  42.0,  28.5) 
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4.0  Summary 

The  scalability  of  scheduling  and  resource  manage¬ 
ment  strategies  will  become  an  increasingly  important 
problem  as  Grids  are  scaled  up.  We  presented  a  scalable 
heuristic  that  has  performed  extremely  well  in  a  simula¬ 
tion  study  of  synthetic  meta-applications  and  metacom¬ 
puting  environments.  Completion  times  for  three 
common  classes  of  meta-applications  were  within  10- 
20%  of  optimal  on  average  with  a  worst-case  variance 
of  60%.  The  results  suggest  that  effective  scheduling  of 
meta-applications  is  possible,  if  sufficient  application 
and  system  resource  cost  information  is  provided  to  the 
system.  Future  work  includes  experimental  validation 
of  the  algorithms  on  a  live  system.  We  are  also  investi¬ 
gating  the  problem  of  multiple  job  scheduling  and  the 
interplay  between  multiple  meta-applications  and  sin¬ 
gle-resource  jobs. 
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Abstract 

Computational  Grids  have  become  an  important  and 
popular  computing  platform  for  both  scientific  and  commer¬ 
cial  distributed  computing  communities.  However,  users  of 
such  systems  typically  find  achievement  of  application  ex¬ 
ecution  performance  remains  challenging.  Although  Grid 
infrastructures  such  as  Legion  and  Globus  provide  basic  re¬ 
source  selection  functionality,  work  allocation  functional¬ 
ity,  and  scheduling  mechanisms,  applications  must  interpret 
system  performance  information  in  terms  of  their  own  re¬ 
quirements  in  order  to  develop  performance -efficient  sched¬ 
ules. 

We  describe  a  new  high-performance  scheduler  that  in¬ 
corporates  dynamic  system  information,  application  re¬ 
quirements,  and  a  detailed  performance  model  in  order  to 
create  performance  efficient  schedules.  While  the  sched¬ 
uler  is  designed  to  provide  improved  performance  for  a 
magneto  hydrodynamics  simulation  in  the  Legion  Compu¬ 
tational  Grid  infrastructure,  the  design  is  generalizable  to 
other  systems  and  other  data- paralle l,  iterative  codes.  We 
describe  the  adaptive  performance  model,  resource  selec¬ 
tion  strategies,  and  scheduling  policies  employed  by  the 
scheduler.  We  demonstrate  the  improvement  in  application 
performance  achieved  by  the  scheduler  in  dedicated  and 
shared  Legion  environments. 
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1.  Introduction 

Computational  Grids  [7]  are  rapidly  becoming  an  impor¬ 
tant  and  popular  computing  platform  for  both  scientific  and 
commercial  distributed  computing  communities.  Grids  in¬ 
tegrate  independently  administered  machines,  storage  sys¬ 
tems,  databases,  networks,  and  scientific  instruments  with 
the  goal  of  providing  greater  delivered  application  perfor¬ 
mance  than  can  be  obtained  from  any  single  site.  There 
are  many  critical  research  challenges  in  the  development  of 
Computational  Grids  as  an  effective  computing  platform. 
For  users,  both  performance  and  programmability  of  the  un¬ 
derlying  infrastructure  are  essential  to  the  successful  imple¬ 
mentation  of  applications  in  Grid  environments. 

The  Legion  Computational  Grid  infrastructure  [11]  pro¬ 
vides  a  sophisticated  object-oriented  programming  envi¬ 
ronment  that  promotes  application  programmability  by 
enabling  transparent  access  to  Grid  resources.  Legion 
provides  basic  resource  selection,  work  allocation,  and 
scheduling  mechanisms.  In  order  to  achieve  desired  per¬ 
formance  levels,  applications  (or  their  users)  must  inter¬ 
pret  system  performance  information  in  terms  of  require¬ 
ments  specific  to  the  target  application.  Application  Level 
Scheduling  (AppLeS)  [3]  is  an  established  methodology 
for  developing  adaptive,  distributed  programs  that  execute 
in  dynamically  changing  and  heterogeneous  execution  set¬ 
tings.  The  ultimate  goal  of  this  work  is  to  draw  upon  the 
AppLeS  and  Legion  Computational  Grid  research  efforts  to 
design  an  adaptive  application  scheduler  for  regular  itera¬ 
tive  stencil  codes  in  Legion  environments. 

We  consider  a  general  class  of  regular,  data-parallel  sten¬ 
cil  codes  which  require  repeated  applications  of  relatively 
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constant-time  operations.  Many  of  these  codes  have  the  fol¬ 
lowing  structure: 

Initialization 

Loop  over  an  n-dimensional  mesh 

Finalization 

in  which  the  basic  activity  of  the  loop  is  a  stencil  based  com¬ 
putation.  In  other  words  the  data  items  in  the  n-dimensional 
mesh  are  updated  based  on  the  values  of  their  nearest  neigh¬ 
bors  in  the  mesh.  Such  codes  are  common  in  scientific  com¬ 
puting  and  include  parallel  implementations  of  matrix  oper¬ 
ations  as  well  as  routines  found  in  packages  such  as  ScaLA- 
PACK  [18]. 

In  this  paper  we  focus  on  the  development  of  an  adaptive 
strategy  for  scheduling  a  regular,  data-parallel  stencil  code 
called  PMHD3D  on  the  Legion  Grid  infrastructure.  The 
primary  contributions  of  this  paper  are: 

•  We  describe  an  adaptive  performance  model  for 
PMHD3D  and  demonstrate  its  ability  to  predict  appli¬ 
cation  performance  in  initial  experiments.  The  perfor¬ 
mance  model  represents  the  application’s  requirements 
for  computation,  communication,  overhead,  and  mem¬ 
ory,  and  could  easily  be  extended  to  serve  more  gener¬ 
ally  as  a  framework  for  regular  iterative  stencil  codes 
in  Grid  environments. 

•  We  couple  the  PMHD3D  performance  model  with  re¬ 
source  selection  strategies,  schedule  selection  policies, 
and  deployment  software  to  form  an  AppLeS  sched¬ 
uler  for  PMHD3D. 

•  In  order  to  satisfy  the  requirements  of  the  PMHD3D 
performance  model  we  implement  and  utilize  a  new 
memory  sensor  as  part  of  the  Network  Weather  Ser¬ 
vice  (NWS)[22].  The  sensor  collects  measurements 
and  produces  forecasts  of  the  amount  of  free  memory 
available  on  a  processor. 

•  We  demonstrate  the  ability  of  the  AppLeS  method¬ 
ology  to  provide  enhanced  performance  for  the 
PMHD3D  application,  using  the  Legion  software  in¬ 
frastructure  as  a  platform  for  high-performance  appli¬ 
cation  execution. 

In  the  next  section  we  discuss  the  structure  of  the  target 
application  and  the  environment  that  we  used  as  a  test-bed. 
In  Section  3,  we  discuss  the  AppLeS  we  have  designed  for 
PMHD3D  and  provide  a  generalizable  performance  model. 
Section  4  provides  experimental  results  and  demonstrates 
performance  improvements  we  achieved  via  AppLeS  using 
Legion.  In  Sections  5  and  6  we  review  related  work  and 
investigate  possible  new  directions,  respectively. 


2.  Research  Components:  AppLeS,  NWS, 
PMHD3D  and  Legion 

In  order  to  build  a  high-performance  scheduler  for 
PMHD3D  we  leveraged  application  characteristics,  dy¬ 
namic  resource  information  from  NWS,  the  AppLeS 
methodology,  and  the  Legion  system  infrastructure.  In  this 
section  we  explain  each  of  these  components  in  detail. 

2.1.  AppLeS 

The  AppLeS  project  focuses  on  the  development  of  a 
methodology  and  software  for  achieving  application  per¬ 
formance  via  adaptive  scheduling  [1].  For  individual  ap¬ 
plications,  an  AppLeS  is  an  agent  that  integrates  with  the 
application  and  uses  dynamic  and  application-specific  in¬ 
formation  to  develop  and  deploy  a  customized  adaptive  ap¬ 
plication  schedule.  For  structurally  similar  classes  of  appli¬ 
cations,  an  AppLeS  template  provides  a  “pluggable”  frame¬ 
work  which  comprises  a  class-specific  performance  model, 
scheduling  model,  and  deployment  module.  An  applica¬ 
tion  from  the  class  can  be  instantiated  within  the  template 
to  form  a  performance-oriented  self-scheduling  application 
targeted  to  the  underlying  Grid  resources. 

AppLeS  schedulers  often  rely  on  available  tools  in  order 
to  deploy  the  schedule  or  to  gather  information  on  resources 
or  environment.  AppLeS  commonly  depends  on  the  Net¬ 
work  Weather  Service  (NWS)  (see  Section  2.4)  to  provide 
dynamic  predictions  of  resource  load  and  availability.  To¬ 
gether,  AppLeS  and  the  Network  Weather  Service  can  be 
used  to  adapt  application  performance  to  the  deliverable  ca¬ 
pacities  of  Grid  resources  at  execution  time.  In  this  project 
AppLeS  uses  Legion  to  execute  a  schedule  and  the  Internet 
Backplane  Protocol  (IBP)  [13]  to  effectively  cache  the  data 
coming  from  NWS. 

2.2.  PMHD3D 

The  target  application  for  this  work,  PMHD3D  [12,  15], 
is  a  magnetohydrodynamics  simulation  developed  at  the 
University  of  Virginia  Department  of  Astronomy  by  John 
F.  Hawley  and  ported  to  Legion  by  Greg  Lindhal.  The  code 
is  an  MPI  FORTRAN  stencil-based  application  and  shares 
many  characteristics  with  other  stencil  codes.  The  code  is 
structured  as  a  three-dimensional  mesh  of  data,  upon  which 
the  same  computation  is  iteratively  performed  on  each  point 
using  data  from  its  neighbors.  PMHD3D  alternates  between 
CPU-intensive  computation  and  communication  (between 
“slab”  neighbors  and  for  barrier  synchronizations). 

At  startup  PMHD3D  reads  a  configuration  file  that  spec¬ 
ifies  the  problem  size  and  the  target  number  of  processors. 
Since  the  other  two  dimensions  are  fixed  in  PMHD3D’s 
three-dimensional  mesh,  we  refer  to  the  height  of  the  mesh 


217 


PMHD3D 


PMHD3D 


Using  AppLcS 
Using  Legion 


Figure  1.  PMHD3D  run-time  scenarios  with  and  without  AppLeS. 


as  the  problem  size.  In  order  to  allocate  work  among  proces¬ 
sors  in  the  computation  the  mesh  is  divided  into  horizontal 
slabs  such  that  each  processor  receives  a  slab.  For  load  bal¬ 
ancing  purposes  each  processor  can  be  assigned  a  different 
amount  of  work  (by  dividing  the  work  into  slabs  of  vary¬ 
ing  height).  The  AppLeS  scheduler  determines  the  optimal 
height  of  each  slab  depending  on  the  raw  speed  of  the  pro¬ 
cessor  and  on  NWS  forecasts  of  CPU  load,  the  amount  of 
free  memory,  and  network  conditions.  AppLeS  is  dynamic 
in  the  sense  that  the  data  used  by  the  scheduler  is  computed 
and  collected  just  before  execution,  but  once  the  schedule  is 
created  and  implemented,  the  execution  currently  proceeds 
without  interaction  with  the  AppLeS. 

2.3.  Legion 

Legion,  a  project  at  the  University  of  Virginia,  is  de¬ 
signed  to  provide  users  with  a  transparent,  secure,  and  re¬ 
liable  interface  to  resources  in  a  wide-area  system,  both  at 
the  programming  interface  level  as  well  as  at  the  end-user 
level  [9,  14].  Both  the  programmer  and  the  end-user  have 
coherent  and  seamless  access  to  all  the  resources  and  ser¬ 
vices  managed  by  Legion.  Legion  addresses  challenging 
issues  in  Computational  Grid  research  such  as  parallelism, 
fault-tolerance,  security,  autonomy,  heterogeneity,  legacy 
code  management,  resource  management,  and  access  trans¬ 
parency. 

Legion  provides  mechanisms  and  facilities,  leaving  to 
the  programmer  the  implementation  of  the  policies  to  be 
enforced  for  a  particular  task.  Following  this  idea,  schedul¬ 
ing  in  Legion  is  flexible  and  can  be  tailored  to  suit  applica¬ 


tions  with  different  requirements.  The  main  Legion  compo¬ 
nents  involved  in  scheduling  are  the  collection ,  the  enactor , 
the  scheduler ,  and  the  hosts  which  will  execute  the  sched¬ 
ule  [5].  The  collection  provides  information  about  the  avail¬ 
able  resources  and  the  scheduler  selects  the  resources  to  be 
used  in  a  schedule.  The  schedule  is  then  given  to  the  enac¬ 
tor,  which  contacts  the  host  objects  involved  in  the  sched¬ 
ule  and  attempts  to  execute  the  application.  This  scheme 
provides  scheduling  flexibility;  for  example,  in  case  of  host 
failures,  the  enactor  can  ask  the  scheduler  for  a  new  sched¬ 
ule  and  continue  despite  the  failure,  the  collection  can  re¬ 
turn  subsets  of  the  resources  depending  on  the  user  and/or 
the  application,  or  the  hosts  can  refuse  to  serve  a  specific 
user. 

Legion  currently  provides  default  implementations  of  all 
the  objects  described  herein.  Moreover,  new  objects  can 
be  developed  and  used  rather  than  the  default  ones.  Note 
that  the  PMHD3D  AppLeS  is  developed  “on  top”  of  Le¬ 
gion,  and  uses  default  Legion  objects.  We  would  expect 
the  performance  improvement  for  such  a  code  to  conserva¬ 
tively  bound  from  below  that  which  would  be  achievable  if 
the  AppLeS  were  structured  as  a  Legion  object.  We  plan 
to  eventually  develop  the  AppLeS  described  here  as  a  Le¬ 
gion  scheduling  object  for  a  class  of  regular,  iterative,  data- 
parallel  applications. 

2.4.  Network  Weather  Service 

The  Network  Weather  Service  [17,  22]  is  a  distributed 
system  that  periodically  monitors  and  dynamically  fore¬ 
casts  the  performance  various  network  and  computational 
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resources  can  deliver.  NWS  is  composed  of  sensors,  mem¬ 
ories  and  forecasters.  Sensors  measure  the  availability  of 
the  resource,  for  example  CPU  availability,  and  then  record 
the  measurement  in  a  NWS  memory .  In  response  to  a  query, 
the  NWS  software  will  return  a  time  series  of  measurements 
from  any  activated  sensor  in  the  system.  This  time  series 
can  then  be  passed  to  the  NWS  forecaster  which  predicts 
the  future  availability  of  the  resource.  The  forecaster  tests 
a  variety  of  predictors  and  returns  the  result  and  expected 
error  of  the  most  accurate  predictor.  To  obtain  better  per¬ 
formance  for  PMHD3D  we  developed  a  memory  sensor 
that  measures  the  available  free  memory  of  a  machine.  The 
sensor  has  been  extended  and  is  now  part  of  NWS. 

2.5.  Interactions  Among  System  Components 

PMHD3D  can  directly  access  Legion’s  scheduling  fa¬ 
cilities  or  can  use  AppLeS  to  obtain  a  more  performance- 
efficient  schedule.  Figure  1  shows  the  interactions  among 
components  in  each  of  these  scenarios.  The  dotted  line  rep¬ 
resents  the  scheduling  of  a  PMHD3D  run  without  AppLeS 
facilities:  the  user  supplies  the  number  of  processors,  the 
processor  list,  and  the  associated  problem  size  per  proces¬ 
sor  and  the  rest  of  the  scheduling  process  is  supplied  by  a 
default  scheduler  within  the  Legion  infrastructure. 

When  the  application  uses  AppLeS  for  scheduling,  the 
interactions  among  components  can  instead  be  represented 
by  the  solid  lines  in  Figure  1.  In  this  case  the  user  sup¬ 
plies  only  the  problem  size  of  interest.  AppLeS  collects 
the  list  of  available  resources  from  the  environment  (via  the 
Legion  collection  object  or,  in  our  case,  via  the  Legion  con¬ 
text  space),  and  then  queries  NWS  to  obtain  updated  per¬ 
formance  and  availability  predictions  for  the  available  re¬ 
sources.  As  the  figure  shows,  AppLeS  collects  the  NWS 
predictions  as  an  IBP  client:  the  predictions  are  pushed  into 
the  IBP  server  by  a  separate  process. 

AppLeS  then  creates  a  performance-promoting  adaptive 
schedule  and  asks  the  Legion  scheduler  to  execute  it.  The 
schedule  is  adaptive  because  AppLeS  assigns  a  different 
amount  of  work  to  each  processor  depending  on  their  pre¬ 
dicted  performance.  As  is  suggested  by  the  figure,  the 
PMHD3D  AppLeS  is  built  on  top  of  Legion  facilities.  A  fu¬ 
ture  goal  is  to  integrate  the  AppLeS  as  an  alternative  sched¬ 
uler  in  Legion  for  the  class  of  regular,  data-parallel,  stencil 
applications. 

3.  The  PMHD3D  AppLeS 

The  general  AppLeS  approach  is  to  create  good  sched¬ 
ules  for  an  application  by  incorporating  application  spe¬ 
cific  characteristics,  system  characteristics,  and  dynamic 
resource  performance  data  in  scheduling  decisions.  The 
PMHD3D  AppLeS  draws  upon  the  general  AppLeS 


methodology  [3]  and  the  experience  gained  building  an  Ap¬ 
pLeS  for  a  structurally  similar  Jacobi-2D  application  [2]. 

Conceptually,  the  PMHD3D  AppLeS  can  be  decom¬ 
posed  into  three  components: 

•  a  performance  model  that  accurately  represents  ap¬ 
plication  performance  within  the  Computational  Grid 
environment; 

•  a  resource  selection  strategy  that  identifies  poten¬ 
tially  performance-efficient  candidate  resource  sets 
from  those  that  are  available  at  run  time; 

•  a  schedule  creation  and  selection  strategy  that  cre¬ 
ates  a  good  schedule  for  each  of  the  various  candidate 
resource  sets  and  then  selects  the  most  performance- 
efficient  schedule. 

The  overall  strategy  and  organization  of  the  scheduler 
will  be  discussed  here  but  the  details  of  each  component  are 
reserved  for  the  following  sections. 

An  accurate  performance  model  (Section  3.1)  is  funda¬ 
mental  for  the  development  of  good  schedules.  The  per¬ 
formance  model  is  used  in  two  important  ways,  the  first  of 
which  is  to  guide  the  creation  of  schedules  for  specific  re¬ 
source  sets.  For  example,  load  balancing  is  a  necessary  con¬ 
dition  developing  an  efficient  schedule  but  is  difficult  or  im¬ 
possible  to  achieve  without  an  estimate  of  the  relative  costs 
of  computation  on  various  resources.  An  accurate  perfor¬ 
mance  model  is  also  necessary  for  selection  of  the  highest 
performance  schedule  from  a  set  of  candidate  schedules. 

The  resource  selection  strategy  (Section  3.2)  produces 
several  orderings  of  available  resources  based  on  differ¬ 
ent  concepts  of  “desirability”  of  resources  to  PMHD3D. 
Our  definitions  of  desirability  incorporate  Legion  re¬ 
source  discovery  results,  dynamic  resource  availability  from 
NWS,  dynamic  performance  forecasts  from  NWS,  and 
application-specific  performance  data  for  each  resource. 
Once  complete,  the  ordered  lists  of  resources  are  passed  on 
to  the  schedule  creation  and  selection  component  of  the  Ap¬ 
pLeS. 

The  schedule  creation  step  (Section  3.3)  takes  the  pro¬ 
posed  resource  lists  and  creates  a  good  schedule  for  each 
based  on  the  constraints  the  system  and  application  im¬ 
pose.  System  constraints  are  characteristics  such  as  avail¬ 
able  memory  of  the  resources  while  the  application  con¬ 
straints  are  characteristics  such  as  the  amount  of  memory  re¬ 
quired  for  the  application  to  remain  in  main  memory.  Once 
all  schedules  have  been  created  the  performance  model  is 
used  to  select  the  highest  performance  schedule  (the  one  in 
which  the  execution  time  is  expected  to  be  the  lowest). 

The  decomposition  of  the  scheduling  process  into  these 
disjoint  steps  provides  an  overly  simplistic  view  of  the  in¬ 
teractions  between  steps.  In  reality  the  scheduling  process 
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1  Rset  =  getResourceSet ( ) 

2  NWS__data  =  NWS  (Rset) 

3  C  =  getScheduleConstraints ( ) 

4  for  (balance  =  {0,  0.5,  1}) 

5  S  =  sort (Rset,  balance,  maxP) 

6  for  (n  =  2 . .maxP) 

7  sched  =  findSched(n,  S,  NWS_data,  C) 

8  while  (sched  is  not  found) 

9  "Schedule  constraints  are  too  restrictive 

10  relaxConstraints  (C) 

11  sched  =  findSched(n,  S,  NWS_data,  C) 

12  endwhile 

13  if  (cost (sched)  <  best) 

14  best  =  sched 

15  endif 

16  endfor 

17  endfor 

18  run (best) 


\  Available  resources  obtained  from  Legion 
\  NWS  forecasts  of  resource  performance 
\  Obtain  scheduling  constraints  for  simplex 
\  Select  for  CPU  power,  connectivity,  both 
\  Returns  list  of  hosts  sorted  by  desirability 
\  Searching  for  correct  number  of  processors 
\\  Use  simplex  to  find  schedule  on  S  using  C 
\\  Simplex  was  unsolvable  with  S  and  C 

\\  More  schedule  flexibility,  more  possible  error 
\\  Try  to  find  schedule  again 
\\  Found  a  feasible  schedule 

\\  If  best  one  so  far  keep  it,  else  throw  away 


\\  Best  schedule  found,  run  it 


Figure  2.  PMHD3D  AppLeS  pseudo-code. 


requires  more  complicated  interactions.  To  accurately  rep¬ 
resent  the  true  interaction  of  the  scheduling  components  we 
present  a  pseudo-code  version  of  the  PMHD3D  AppLeS 
strategy  in  Figure  2.  The  steps  shown  in  Figure  2  will  be¬ 
come  clearer  in  the  following  sections. 

3.1.  Performance  Model 

The  goal  of  the  performance  model  is  to  accurately  pre¬ 
dict  the  execution  time  of  PMHD3D.  Since  the  run-time 
may  vary  somewhat  from  processor  to  processor,  we  take 
the  maximum  run-time  of  any  processor  involved  in  the 
computation  as  the  overall  run-time.  During  every  iteration 
each  processor  computes  on  its  slab  of  data,  communicates 
with  its  neighbors,  and  synchronizes  with  all  other  proces¬ 
sors. 

Formally,  the  running  time  for  processor  i  is  given  by: 

Ti  —  Compi  +  Commi  +  Overi 

where  Compi ,  Commi  and  Overi  are  the  predicted  com¬ 
putation  time,  the  predicted  communication  time,  and  the 
estimated  overhead  for  Pi ,  respectively. 

Computation  time  is  directly  related  to  the  units  of  work 
assigned  to  a  processor  (in  other  words  the  height  of  the 
slab)  and  to  the  speed  of  that  processor.  The  computation 
time  for  Pi  is: 

Xi  *  BMi 

Comft  =  M" 

where  Xi  is  the  amount  of  work  allocated  to  processor  Pz 
(dynamically  determined  by  the  scheduling  process),  BMi 
is  a  benchmark  for  the  application-specific  speed  of  Pi  s 
processor  configuration,  and  Availi  is  a  forecast  of  the  CPU 
load  on  processor  Pi  (obtained  from  dynamic  NWS  fore¬ 
casts).  To  obtain  the  benchmarks,  we  run  PMHD3D  on 


dedicated  machines  with  various  problem  sizes  and  vari¬ 
able  number  of  hosts.  Execution  times  were  proportional 
to  problem  size  and  are  given  in  terms  of  seconds  per  point 
on  each  platform. 

Communication  time  is  modeled  as  the  time  required 
for  transferring  data  to  neighboring  processors  across  the 
available  network.  This  represents  communication  for  all 
iterations  and  accounts  for  both  the  time  to  establish  a  con¬ 
nection  and  the  time  to  transfer  the  messages.  To  simplify 
the  communication  model,  we  have  not  attempted  to  di¬ 
rectly  predict  synchronization  time  or  the  time  a  processor 
waits  for  a  communication  partner.  We  hope  instead  to  cap¬ 
ture  the  effect  of  these  communication  costs  in  our  estimate 
of  overhead  costs,  which  we  discuss  shortly.  Communica¬ 
tion  time  is  then: 

Commi  —  MB/(biyi+i  -F  &i,j_i)  +  M  *  {h,i+ 1  +  h,i~ i) 

where  MB  is  the  total  megabytes  transfered,  M  is  the  num¬ 
ber  of  messages  transfered,  and  bij  and  Uj  are  predictions 
of  available  bandwidth  and  latency  from  Pi  to  Pjy  respec¬ 
tively.  Predictions  of  available  bandwidth  and  latency  be¬ 
tween  pairs  of  processors  are  obtained  from  dynamic  NWS 
forecasts.  To  provide  an  estimate  of  the  number  of  mes¬ 
sages  transferred  (M)  and  the  megabytes  transferred  (MB) 
we  examined  post-execution  program  performance  reports 
provided  by  Legion.  For  a  variety  of  problem  sizes  and  re¬ 
source  set  sizes  the  number  of  megabytes  transferred  var¬ 
ied  by  less  than  5%  so  we  used  an  average  value  for  all 
runs.  Data  transfer  does  not  significantly  vary  with  prob¬ 
lem  size  because  the  problem  size  affects  only  the  height  of 
the  grid  while  the  decomposition  is  performed  horizontally. 
Data  transfer  costs  also  do  not  vary  with  number  of  proces¬ 
sors  because  each  processor  must  communicate  with  only 
its  neighbors,  regardless  of  the  total  number  of  processors. 
Although  the  number  of  messages  transferred  varied  more 
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significantly  from  run-to-run  we  also  used  an  average  value 
for  this  variable.  This  approximation  did  not  adversely  af¬ 
fect  our  scheduling  ability  in  the  environments  we  tested;  in 
cases  where  communication  costs  are  more  severe  a  model 
could  be  developed  to  approximate  the  expected  number  of 
messages  transferred. 

The  overhead  factor  Overi  is  included  in  the  perfor¬ 
mance  model  to  capture  application  and  system  behav¬ 
ior  that  cannot  be  accounted  for  by  a  simple  commu¬ 
nication/computation  model.  For  example,  a  processor 
will  likely  spend  time  synchronizing  with  other  processors, 
waiting  for  neighbor  processors  for  data  communication, 
and  waiting  for  system  delays.  System  overheads  are  as¬ 
sociated  with  specifics  of  the  hardware  and  Legion  infras¬ 
tructure  such  as  the  time  required  to  resolve  the  physical 
location  of  a  data  object  needed  by  the  application.  The 
overhead  for  PMHD3D  can  be  estimated  by: 

Overi  =  16  —  1.5  *  probSize/ 1000  4-  0.094P2 

where  P  is  the  number  of  processors  involved  in  the  com¬ 
putation  and  probSize  is  the  height  of  the  PMHD3D  mesh. 

Overi  was  estimated  empirically  using  data  from  106 
individual  application  executions  with  problem  sizes  vary¬ 
ing  from  1000  to  6000  and  with  resource  set  sizes  vary¬ 
ing  between  4  and  26.  To  determine  the  effect  of  the  num¬ 
ber  of  processors  on  overhead  runs,  runs  were  grouped  by 
problem  size  and  the  corresponding  execution  times  plot¬ 
ted  against  number  of  processors.  For  each  set  of  runs  per¬ 
formed  with  the  same  problem  size,  a  quadratic  fit  was  per¬ 
formed  on  the  difference  between  the  actual  execution  time 
and  the  predicted  execution  time  (without  the  overhead  fac¬ 
tor).  The  quadratic  factor  varied  between  0.090  and  0.096 
with  a  mean  of  0.094  (standard  deviation  of  0.0022).  To 
determine  the  effect  of  problem  size  on  overhead  we  used 
the  same  runs  but  did  a  linear  datafit  on  the  predicted/actual 
execution  time  difference  with  problem  size. 

3.2.  Resource  Selection 

Resource  selection  is  the  process  of  selecting  a  set 
of  target  resources  (processors  in  this  case)  that  will  be 
performance-efficient.  Finding  the  optimal  set  of  resources 
requires  comparing  all  possible  schedules  on  all  possible 
subsets  of  the  resource  pool  -  clearly  an  inefficient  pro¬ 
cess  as  the  resource  pool  becomes  large.  Instead,  we  create 
several  ordered  lists  of  resources  by  employing  a  heuristic 
to  sort  candidate  resources  in  terms  of  several  definitions 
of  resource  desirability .  Resource  desirability  is  based  on 
how  resource  characteristics  such  as  computational  speed 
and  network  connectivity  will  affect  the  performance  of 
PMHD3D. 

The  resource  selection  process  begins  by  querying  Le¬ 
gion  to  discover  the  available  set  of  resources.  Effec¬ 


tive  evaluation  of  the  desirability  of  each  resource  requires 
application-specific  performance  information  as  well  as  dy¬ 
namic  resource  performance  information.  As  of  this  writ¬ 
ing,  Legion  collection  objects  report  available  resources  and 
their  static  configurations  but  do  not  provide  up-to-date  dy¬ 
namic  information  on  availability,  load,  or  connectivity.  Ac¬ 
cordingly,  the  list  of  available  resources  reported  by  Legion 
is  used  to  query  NWS  for  dynamic  forecasts  of  resource 
availability,  CPU  load,  and  free  memory  for  each  host  and 
of  latency  and  bandwidth  between  all  pairs  of  hosts.  To  ob¬ 
tain  the  computational  cost  per  unit  of  the  PMHD3D  grid 
on  each  type  of  resource  we  used  the  benchmarking  method 
described  in  Section  3.1. 

Once  the  available  resource  lists  and  the  dynamic  sys¬ 
tem  characteristics  are  collected,  the  list  can  be  ordered  in 
terms  of  desirability.  We  use  three  definitions  of  desirability 
of  a  resource:  desirability  based  on  connectivity,  desirabil¬ 
ity  based  on  computational  power,  and  desirability  based 
equally  on  the  two  characteristics.  Connectivity  is  approx¬ 
imated  by  computing  the  latency  and  bandwidth  between 
the  resource  in  question  and  all  other  resources  in  the  re¬ 
source  pool:  as  a  metric  we  calculate  the  amount  of  time 
(seconds)  it  would  take  for  the  resource  in  question  to  ex¬ 
change  a  packet  of  size  1  byte  to  and  from  every  other  host. 
Computational  power  is  measured  by  the  time  (seconds)  it 
would  take  the  host  to  compute  1  point  for  1  iteration  based 
on  the  NWS  predictions  and  the  benchmarks  we  discussed 
earlier.  The  balanced  strategy  orders  the  resources  based  on 
an  average  of  computational  power  and  connectivity. 

The  resource  set  is  sorted  into  3  resource  lists  using  the 
3  notions  of  resource  desirability.  We  then  create  subsets  of 
the  lists  by  selecting  the  n  most  desirable  hosts  from  each 
list  where  n  =  2 ...maxP  and  n  is  even.  We  select  multiple 
subsets  from  each  list  because  it  is  often  impossible  to  know 
the  optimal  number  of  hosts  a  priori.  Once  the  subsets  have 
been  created  the  resulting  group  of  proposed  resource  sets 
are  passed  on  to  the  schedule  creation  step  described  in  the 
next  section.  Although  the  approach  described  here  is  not 
guaranteed  to  find  the  optimal  resource  set,  the  methodol¬ 
ogy  provides  a  scalable  and  performance-efficient  approach 
to  resource  selection. 

3.3.  Schedule  Creation  and  Selection 

For  each  of  the  proposed  resource  sets,  a  schedule  is  de¬ 
veloped.  Essentially,  schedule  development  on  a  given  re¬ 
source  set  for  PMHD3D  reduces  to  finding  a  work  alloca¬ 
tion  that  provides  good  time  balancing.  As  in  Section  3.1 
work  allocation  is  represented  by  and  is  the  height  of  the 
slab  given  to  processor  Pi . 

One  of  the  most  important  characteristics  for  any  solu¬ 
tion  to  this  problem  is  time  balancing:  all  processors  should 
finish  at  the  same  time.  Using  the  notation  from  Section  3.1, 
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Ti  =  Ti+ 1,  i  E  {1 . . .  (n  —  1)}  and,  since  all  of  the 
work  must  be  allocated,  we  also  have  X{  —  probSize. 
Taken  together  we  have  n  equations  in  n  unknowns  and  the 
problem  can  be  solved  with  a  basic  linear  solver.  This  ap¬ 
proach  was  successful  for  the  Jacobi-2D  AppLeS  [2]  but  is 
not  powerful  enough  to  incorporate  several  additional  con¬ 
straints  required  to  develop  good  schedules  for  PMHD3D. 

One  of  the  important  constraints  for  PMHD3D  perfor¬ 
mance  is  the  amount  of  memory  available  for  the  applica¬ 
tion.  There  is  a  limit  to  the  size  of  problem  that  can  be 
placed  on  a  machine  because  if  the  computation  spills  out 
of  memory,  performance  can  drop  by  two  orders  of  magni¬ 
tude.  To  quantify  this  constraint  a  benchmark  for  applica¬ 
tion  memory  usage  must  be  obtained  by  observing  memory 
usage  for  varying  problem  sizes  on  each  type  of  resource. 
Formally,  this  constraint  becomes: 

BMmerrii  *  Xi  <  MemAvaik 

where  MemAvaik  is  the  available  memory  for  processor 
i  (provided  by  the  NWS  memory  sensor)  and  BMmemi  is 
the  memory  benchmark  (megabytes/unit)  recorded  for  pro¬ 
cessor  i's  architecture. 

We  formalize  the  work  allocation  constraints  as  a  Lin¬ 
ear  Programming  problem  (from  now  on  simply  LP),  solv¬ 
able  with  the  simplex  method  [6].  In  short,  LP  solves  the 
problem  of  finding  an  extreme  (maximum  or  minimum)  of 
a  function  f(x i,  X2>  •  ■  •  ,xn)  where  the  unknowns  have  to 
satisfy  a  set  of  constraints  g(x i,  £2,  *  •  •  > xn)  &  an^  both 
the  objective  function  and  the  constraints  are  linear.  The 
simplex  is  a  well-known  method  used  to  solve  LP  prob¬ 
lems.  The  simplex  formulation  requires  that  constraints  are 
expressed  in  standard  form’,  that  is  the  constraints  must  be 
expressed  as  equalities  and  each  variable  is  assigned  a  non¬ 
negativity  sign  restriction.  There  is  a  simple  procedure  that 
can  be  used  to  transform  LP  problems  into  a  standard  form 
equivalent. 

We  modified  the  time  balancing  equations  to  provide 
some  flexibility  for  the  constraints  specification:  expected 
execution  time  for  any  processor  in  the  computation  must 
fall  within  a  small  percentage  of  the  expected  total  running 
time.  This  flexibility  is  beneficial,  especially  as  additional 
constraints  such  as  memory  limits  are  incorporated  into  the 
problem  formulation.  The  constraints  are  initially  very  rigid 
but  can  be  relaxed  in  cases  where  no  solution  can  be  found 
given  the  initial  constraints.  The  time  balancing  equations 
and  the  application  memory  requirements  form  the  applica¬ 
tion  constraints  on  which  the  simplex  has  to  operate.  The 
simplex  formulation  also  requires  specification  of  an  objec¬ 
tive  function  where  the  goal  of  the  solver  is  to  maximize  the 
objective  function  while  satisfying  the  simplex  constraints. 
We  use  Yli  xi  as  the  objective  function  and  search  for  a  so¬ 
lution  where  all  work  is  allocated. 

For  each  of  the  proposed  resource  sets  the  simplex  is 


used  to  create  the  best  schedule  possible  for  that  resource 
set.  We  use  a  library  [16]  which  provides  a  fast  and  easy  to 
use  implementation  of  the  simplex.  There  are  several  bene¬ 
fits  of  using  linear  programming  and  the  simplex  method  to 
create  a  good  schedule: 

•  Linear  programming  is  well  known  and  commonly 
used  so  that  fast  and  reliable  algorithms  are  readily 
available. 

•  Once  the  constraints  are  formalized  as  a  linear  pro¬ 
gramming  problem,  adding  additional  constraints  is 
trivial.  For  example,  the  FORTRAN  compiler  used  to 
compile  PMHD3D  enforced  a  limit  on  the  maximum 
size  of  arrays,  therefore  limiting  the  maximum  units 
of  work  that  could  be  allocated  to  any  processor.  This 
constraint  was  easily  added  to  the  problem  formaliza¬ 
tion. 

•  The  linear  programming  problem  can  be  extended  to 
give  integer  solutions,  although  the  problem  then  be¬ 
comes  much  more  difficult.  Currently  the  solver  com¬ 
putes  real  values  for  work  allocation  and  we  redis¬ 
tribute  the  fractional  work  portions.  In  some  problems 
a  linear  solution  may  be  required  for  additional  accu¬ 
racy. 

•  In  the  case  that  a  solution  cannot  be  found,  the  simplex 
method  provides  important  feedback.  For  this  applica¬ 
tion,  the  simplex  could  not  find  a  solution  if  the  con¬ 
straints  were  too  restrictive.  In  this  case  the  simplex  is 
reiterated  with  successively  relaxed  constraints  until  a 
solution  can  be  reached. 

Once  the  proposed  schedules  are  identified,  schedule  se¬ 
lection  is  surprisingly  simple.  The  performance  model  is 
used  to  evaluate  the  expected  execution  time  of  each  pro¬ 
posed  schedule,  and  the  schedule  with  the  lowest  estimated 
execution  time  is  selected  and  implemented. 

4.  Results 

The  PMHD3D  AppLeS  has  been  implemented  and  we 
present  results  to  investigate  the  usefulness  of  the  method¬ 
ology.  The  goals  of  these  experiments  were  to: 

•  Evaluate  the  accuracy  of  our  performance  prediction 
model. 

•  Evaluate  the  ability  of  the  PMHD3D  AppLeS  to  pro¬ 
mote  application  performance  in  a  multi-user  Legion 
environment. 

The  previous  sections  stressed  the  importance  of  the  per¬ 
formance  model  for  effective  scheduling.  In  Section  4.2  we 
explain  in  detail  results  demonstrating  the  accuracy  of  the 
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performance  model.  In  Section  4.3  we  present  evidence  that 
the  scheduling  methodology  and  implementation  are  effec¬ 
tive  in  practice.  Before  discussing  these  results  we  first  out¬ 
line  our  experimental  design. 

4.1.  Experimental  Design 

To  evaluate  the  PMHD3D  AppLeS,  we  conducted  ex¬ 
periments  on  the  University  of  Virginia  Centurion  Cluster, 
a  large  cluster  of  machines  maintained  by  the  Legion  team 
(see  [4]  for  more  information  on  the  cluster).  The  Centu¬ 
rion  Cluster  is  continuously  upgraded  for  new  Legion  ver¬ 
sion  releases;  during  the  3-month  period  of  the  experiments, 
we  used  Legion  versions  1.5  through  1.6.1.  The  cluster  it¬ 
self  is  composed  of  128  Alphas  and  128  Dual-Pentium  II 
PCs;  12  fast  Ethernet  switches  and  a  gigaswitch  connect 
the  whole  cluster.  Although  we  employed  both  Alphas  and 
Pentiums  during  the  development  and  initial  testing  process, 
we  had  multiple  difficulties  with  Alpha  Linux  kernel  insta¬ 
bilities  and  a  faulty  network  driver  which  made  our  data 
for  the  Alphas  machines  unreliable.  The  results  presented 
here  are  based  only  on  the  400  MHz  Dual  Pentium  II  ma¬ 
chines.  We  didn’t  employ  the  second  processor  on  the  Dual 
Pentium:  therefore  when  we  talk  about  host  or  machine  we 
consider  the  machines  to  be  uniprocessors.  It  is  worth  not¬ 
ing  that  many  users  only  use  one  processor  per  node  so  that 
even  a  computationally  intensive  user  will  not  affect  CPU 
availability  as  much  as  might  be  expected.  However,  the 
two  processors  on  each  Dual  Pentium  machine  utilize  the 
same  memory,  sometimes  leading  to  performance  degrada¬ 
tion  due  to  overloaded  memory  systems.  Inclusion  of  mem¬ 
ory  constraints  in  the  performance  model  helped  the  Ap¬ 
pLeS  scheduler  avoid  overloaded  memory  systems. 

We  restricted  our  experiments  to  34  machines  for  practi¬ 
cal  reasons:  the  dynamic  information  collected  from  NWS 
includes  a  large  amount  of  data,  even  for  a  relatively  small 
cluster.  Limiting  the  resource  pool  did  not  impact  inves¬ 
tigations  of  application  performance  or  schedule  efficiency 
because,  as  will  become  clear,  the  parallelism  available  in 
PMHD3D  for  the  problem  sizes  studied  here  is  well  below 
the  34  machine  limit.  As  explained  in  Section  2.5  we  used 
an  IBP  server  running  at  all  times  at  UCSD,  while  AppLeS 
acted  as  an  IBP  client  retrieving  the  forecasts.  This  setup 
allowed  us  to  obtain  updated  predictions  for  a  large  number 
of  resources  in  a  reasonable  amount  of  time.  On  average  it 
took  less  than  4  seconds  to  retrieve  the  data,  with  a  mini¬ 
mum  of  2.5  seconds  and  a  maximum  of  8.5  seconds. 

To  test  the  performance  of  PMHD3D  under  a  variety 
of  conditions,  experiments  were  typically  performed  with 
maximum  resource  set  sizes  (from  now  on  called  resource 
pool  or  simply  pool)  of  4, 6.. .26  and  problem  sizes  of 
1000, 2000. ..6000.  Problem  size  is  the  height  of  the  data 
grid  used  by  PMHD3D.  The  pool  is  the  maximum  num¬ 


ber  of  machines  the  scheduler  is  allowed  to  employ.  We 
test  varying  pool  sizes  to  simulate  conditions  under  which 
a  user  may  be  limited  to  a  certain  number  of  resources  by 
cost  or  access  considerations.  Although  our  overall  resource 
pool  contains  34  machines  in  total,  the  maximum  pool  size 
we  simulate  is  only  26.  This  choice  was  practical:  we  fre¬ 
quently  found  unavailable  or  inaccessible  machines  in  our 
overall  resource  pool  and  so  were  never  able  to  access  all 
34  machines  at  one  time.  Note  also  that  the  scheduler  may 
determine  that  utilizing  the  entire  pool  is  not  the  most  per¬ 
formance  efficient  choice.  In  this  case  the  pool  is  larger  than 
the  number  of  target  resources. 

The  experiments  presented  in  Section  4.2  were  con¬ 
ducted  under  unloaded  conditions  while  those  presented  in 
Section  4.3  were  conducted  under  loaded  conditions.  The 
ambient  load  present  during  most  of  our  loaded  runs  con¬ 
sisted  of  heavy  use  of  some  machines  and  light  use  of  oth¬ 
ers.  In  order  to  investigate  application  performance  we 
report  performance  results  based  on  application  execution 
time.  However,  there  is  a  cost  associated  with  using  Ap¬ 
pLeS  to  develop  a  schedule.  We  analyzed  43  runs  in  detail 
and  the  dominant  scheduling  cost  is  associated  with  query¬ 
ing  the  Legion  Collection  and  the  Legion  context  space. 
The  time  required  to  access  NWS  and  IBP  is  on  average  less 
than  4  seconds.  Once  the  system  and  performance  informa¬ 
tion  has  been  collected,  the  AppLeS  required  on  average 
roughly  1  second  to  order  the  resources,  create  schedules, 
and  select  the  best  schedule. 

4.2.  Performance  Model  Validation 

The  performance  model  is  the  basis  for  determining  a 
good  work  allocation  and,  more  importantly,  provides  the 
basis  for  selecting  a  final  schedule  among  those  that  have 
been  considered.  We  tested  model  accuracy  for  a  variety  of 
problem  sizes  and  target  resource  sets  (see  Figure  3).  For 
the  62  runs  shown  in  this  figure  the  model  accurately  pre¬ 
dicts  execution  time  within  7.5%,  on  average.  The  perfor¬ 
mance  model  consistently  achieved  this  level  of  accuracy 
for  other  runs  taken  under  similar  conditions.  Notice  that  as 
the  problem  size  becomes  larger,  the  smallest  pool  that  we 
test  also  increases  (i.e.  the  smallest  pool  for  a  problem  size 
of  2000  is  of  size  4  while  for  a  problem  size  of  6000  it  is 
12).  This  experimental  setup  was  required  by  a  limit  in  the 
g77  FORTRAN  compiler  we  employed:  no  more  than  507 
work  units  could  be  allocated  to  any  one  processor  during 
the  computation. 

Figure  3  demonstrates  the  importance  of  selecting  an  ap¬ 
propriate  number  of  target  resources  for  PMHD3D.  For  ex¬ 
ample,  for  a  problem  size  of  1000  the  minimal  execution 
time  is  achieved  when  the  application  is  run  on  10  proces¬ 
sors.  If  fewer  processors  are  used,  the  amount  of  work  per 
processor  is  high  and  the  overall  execution  time  is  higher. 
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Figure  3.  Model  predictions  {dashed  lines)  and  observed  execution  time  (solid  lines)  for  a  variety  of 
problem  sizes  and  pool  sizes. 


Table  1.  Number  of  resources  to  target  for 
various  problem  sizes  under  unloaded  con¬ 
ditions.  Optimal  is  the  best  choice,  range  in¬ 
dicates  close  to  optimal  choices. 


Size 

1000 

2000 

3000 

4000 

5000 

6000 

Hosts 

Range 

10 

8-12 

12 

12-14 

14 

14-16 

16 

14-18 

18 

16-18 

18 

18-20 

If  more  processors  are  used,  the  added  communication  and 
system  overheads  cannot  be  offset  by  the  advantage  of  the 
additional  computational  power.  Significantly,  the  perfor¬ 
mance  model  accurately  tracks  the  knee  (i.e.  inflection 
point)  in  the  curve  and  is  thus  capable  of  predicting  the  cor¬ 
rect  number  of  target  resources,  at  least  under  these  con¬ 
ditions.  We  report  the  optimal  number  of  target  resources 
for  all  problem  sizes  tested  in  Table  1 .  As  will  be  obvious 
in  Section  4.3,  the  optimal  number  of  processors  may  vary 
with  resource  performance  and  dynamic  system  conditions 
as  well  as  with  problem  size. 

Figure  4  demonstrates  the  scheduling  advantage  of  accu¬ 
rately  predicting  the  correct  number  of  processors  to  target. 
In  these  experiments  the  PMHD3D  AppLeS  was  allowed  to 
select  any  number  of  processors  up  to  the  maximum  pool 


size.  The  PMHD3D  AppLeS  selects  the  maximum  num¬ 
ber  of  resources  for  each  resource  pool  up  to  and  including 
a  size  of  18.  For  resource  pools  of  size  20  and  larger  the 
optimal  number  of  hosts  is  18  and  the  PMHD3D  AppLeS 
correctly  selects  only  18  hosts. 


Figure  4.  PMHD3D  AppLeS  predicted  and  ac¬ 
tual  execution  times  for  a  problem  size  of 
5000. 


4.3.  Performance  Results 

Once  we  verified  that  the  performance  model  is  accu¬ 
rate  in  a  predictable  environment  (i.e.  where  resources 
are  dedicated),  we  turned  our  attention  to  considering  the 
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performance  of  the  AppLeS  in  a  more  dynamic,  unpre¬ 
dictable,  multi-user  environment.  We  begin  by  investigat¬ 
ing  the  ability  of  PMHD3D  AppLeS  to  compare  available 
resources  and  select  desirable  hosts  (computationally  fast, 
well-connected,  or  both).  To  provide  a  comparison  point  we 
test  the  performance  of  another  available  scheduler,  namely 
the  default  Legion  scheduler.  We  conducted  experiments  in 
runs ,  namely  back-to-back  PMHD3D  executions  using  the 
same  resource  pool  and  the  same  problem  size  but  utiliz¬ 
ing  the  PMHD3D  AppLeS  scheduler  first  and  the  default 
Legion  scheduler  second. 


10  15  20 

Maximum  Allowed  Processors 


25 


Figure  5.  PMHD3D  performance  attained  with 
and  without  the  AppLeS  scheduler  for  a  prob¬ 
lem  size  of  1000. 


In  Fig.  5  we  show  a  series  of  runs  comparing  the 
two  schedulers  for  a  problem  size  of  1000.  Clearly,  the 
PMHD3D  AppLeS  provides  a  performance  advantage  for 
all  resource  set  sizes  tested.  However,  it  is  notable  that  the 
two  execution  time  curves  follow  the  same  trend  only  when 
the  resource  pool  is  in  the  range  of  4-12  hosts.  When  more 
resources  are  added  to  the  pool  the  execution  time  achieved 
with  the  PMHD3D  AppLeS  remains  constant  while  the  de¬ 
fault  Legion  scheduler  execution  time  diverges.  The  default 
Legion  scheduler  allocates  all  available  resources,  a  less 
than  optimal  strategy  for  PMHD3D.  In  Table  2  we  report 
the  typical  number  of  processors  selected  by  AppLeS  for 
different  problem  sizes  and  resource  set  sizes. 

For  pool  sizes  of  4  -  12  performance  achieved  via  the 
PMHD3D  AppLeS  is  consistently  20  —  25  seconds  lower 
than  that  achieved  via  the  default  scheduler.  In  this  range 
of  pool  sizes,  the  PMHD3D  AppLeS  selects  the  maximum 
number  of  hosts  available  and  so  uses  the  same  number  of 
resources  as  the  default  Legion  scheduler.  The  performance 
advantage  is  achieved  by  selecting  “desirable”  resources, 
i.e.  resources  that  are  computationally  fast  and/or  well- 
connected.  Figure  6  illustrates  the  load  of  all  available  ma¬ 
chines  just  before  scheduling  occurred  for  the  18-processor 
run  shown  in  Figure  5.  Clearly,  the  PMHD3D  AppLeS  se¬ 
lects  lightly  loaded  hosts  (i.e.  those  hosts  with  high  avail¬ 
ability)  while  the  default  scheduler  selects  several  loaded 
hosts.  It  is  the  load  on  these  selected  machines  that  causes 


Table  2.  Hosts  chosen  by  PMHD3D  AppLeS. 
The  Legion  default  scheduler  always  selects 
the  maximum  number  of  hosts. 
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a  performance  disadvantage  for  the  default  scheduler.  In  a 
more  heterogeneous  network  environment  the  connectivity 
of  the  hosts  would  also  play  an  important  role  in  host  selec¬ 
tion  and  resulting  performance. 

We  obtained  83  runs  comparing  the  default  Legion 
scheduler  to  the  PMHD3D  AppLeS  for  a  variety  of  problem 
sizes  (1000-6000)  and  pool  sizes  (4-26).  Figure  7  shows  a 
histogram  of  the  percent  improvement  the  PMHD3D  Ap¬ 
pLeS  achieved  over  the  default  Legion  scheduler  for  the  83 
runs  (the  average  improvement  was  30%). 

Note  that  in  a  few  runs  there  was  little  or  no  advantage 
to  using  the  PMHD3D  AppLeS.  In  these  cases  the  proces¬ 
sors  were  essentially  idle  and  the  pool  size  was  below  the 
optimal  number  so  that  the  schedulers  selected  the  same 
number  of  processors.  In  one  run  the  PMHD3D  AppLeS- 
determined  schedule  was  considerably  slower  than  that  de¬ 
termined  by  the  default  Legion  scheduler.  In  this  case  the 
scheduler  created  a  schedule  based  on  incorrect  system  in¬ 
formation:  NWS  forecasts  of  CPU  availability  were  unable 
to  a  predict  a  sudden  change  in  load  on  several  machines 
and  the  resulting  schedule  was  poorly  load  balanced. 

The  Legion  default  scheduler  was  designed  to  provide 
general  scheduling  services,  not  the  specialized  services  we 
include  in  the  PMHD3D  AppLeS.  It  is  therefore  not  sur¬ 
prising  that  the  AppLeS  is  better  able  to  promote  applica¬ 
tion  performance.  In  fact,  the  PMHD3D  AppLeS  could  be 
developed  as  a  Legion  object  for  scheduling  regular,  iter¬ 
ative,  data-parallel  computations,  and  this  is  a  focus  of  fu¬ 
ture  work.  Using  the  PMHD3D  AppLeS  and  the  Legion  de¬ 
fault  scheduling  strategy  as  extremes,  we  wanted  to  explore 
a  third  alternative  for  scheduling  -  that  of  what  a  “smart 
user”  might  do:  In  a  typical  user  scenario  for  a  cluster  of 
machines  a  user  will  have  access  to  a  large  number  of  ma¬ 
chines  and  will  typically  do  a  back-of-the-envelope  static 
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Figure  6.  A  snapshot  of  CPU  availability  taken 
during  scheduling  for  the  18-processor  run 
shown  in  Figure  5. 


Figure  8.  Performance  obtained  by  three 
schedulers  when  each  was  given  access  to 
at  most  26  processors. 


Figure  7.  Range  of  performance  improvement 
obtained  by  PMHD3D  AppLeS. 


calculation  to  determine  an  appropriate  number  of  target  re¬ 
sources  given  the  granularity  of  the  application.  Although  a 
user  may  correctly  determine  the  number  of  hosts  to  target, 
accurate  information  on  resource  load  and  availability  will 
be  difficult  or  impossible  to  obtain  and  interpret  prior  to  or 
at  compile-time. 

To  simulate  this  user  scenario,  we  developed  a  third 
scheduling  method  called  the  smart  user.  The  smart  user 
selects  an  appropriate  number  of  hosts  but  does  not  select 
hosts  based  on  desirability.  Experiments  were  performed 
for  problem  sizes  ranging  from  1000  to  6000  with  a  pool 
size  of  26  hosts.  Figure  8  shows  the  performance  obtained 
by  the  PMHD3D  AppLeS,  the  default  Legion  scheduler, 
and  that  obtained  by  the  smart  user.  In  these  experiments, 
the  PMHD3D  AppLeS  provides  a  significant  performance 
advantage  over  both  alternatives. 

5.  Related  Work 

The  PMHD3D  AppLeS  is  an  adaptation  and  extension  of 
previous  work  targeting  the  structurally  similar  Jacobi-2D 
application  ([2], [3]).  Jacobi-2D  is  a  data-parallel,  stencil- 
based  iterative  code,  as  is  PMHD3D.  Both  applications  al¬ 
low  non-uniform  work  distribution,  however  Jacobi-2D  em¬ 
ploys  strip  decomposition  (using  strip  widths)  for  its  2- 


dimensional  grid  while  PMHD3D  employs  slab  decompo¬ 
sition  (using  slab  height)  for  its  3-dimensional  grid.  While 
the  applications  are  structurally  similar,  PMHD3D  required 
tighter  constraints  on  memory  availability  and  a  more  com¬ 
plex  performance  model.  Additionally,  PMHD3D  was  tar¬ 
geted  for  a  much  larger  resource  set  (34  machines  vs.  8). 
The  availability  of  a  larger  resource  pool  for  this  work  mo¬ 
tivated  the  introduction  of  the  quadratic  overhead  term  in 
the  PMHD3D  performance  model.  Previous  AppLeS  work 
has  not  included  the  additional  overhead  of  using  extra  ma¬ 
chines  in  scheduling  decisions. 

As  part  of  our  previous  work,  we  developed  an  AppLeS 
for  Complib  and  the  Mentat  distributed  programming  en¬ 
vironment.  Complib  implements  a  genetic  sequencing  al¬ 
gorithm  for  libraries  of  sequences.  It  is  particularly  diffi¬ 
cult  to  schedule  because  of  its  highly  data  dependent  exe¬ 
cution  profile.  The  implementation  of  Complib  we  chose 
was  for  Mentat  [8]  which  is  an  early  prototype  of  the  Le¬ 
gion  Grid  software  infrastructure.  By  combining  a  fixed 
initial  distribution  strategy  (based  on  a  combination  of  ap¬ 
plication  characteristics  and  NWS  forecasts)  with  a  shared 
work-queue  distribution  strategy,  the  Complib  AppLeS  was 
able  to  achieve  large  performance  improvements  in  dif¬ 
ferent  Grid  settings  [20].  In  addition  to  AppLeS  for  Le- 
gion/Mentat  applications,  we  have  developed  AppLeS  for  a 
variety  of  Grid  infrastructures  and  applications  [19,  21,  7]. 

In  [10],  the  authors  describe  a  scheduler  targeting  data 
parallel  “stencil”  applications  that  use  the  Mentat  program¬ 
ming  system.  They  specifically  examine  Gaussian  elimi¬ 
nation  using  a  master/slave  work-distribution  methodology. 
While  it  is  difficult  to  compare  the  performance  of  each  sys¬ 
tem,  their  approach  differs  from  AppLeS  in  that  it  requires 
more  extensive  modification  of  the  application  and  it  does 
not  incorporate  dynamic  information. 
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6.  New  Directions 

An  ultimate  goal  is  to  offer  the  PMHD3D  AppLeS  agent 
within  the  Legion  framework  as  a  default  scheduler  for  it¬ 
erative,  regular,  stencil-based  distributed  applications.  In 
particular,  the  scheduler’s  performance  model  is  flexible 
enough  to  incorporate  the  requirements  and  constraints  of 
other  stencil  applications  and  the  characteristics  of  other 
platforms.  To  use  this  model  for  other  appropriate  appli¬ 
cations,  good  predictions  of  megabytes  transferred,  number 
of  messages  initiated,  overhead  factor,  benchmarks  for  pro¬ 
gram  CPU  and  memory  utilization  over  the  different  target 
architectures,  as  well  as  access  to  dynamic  system  infor¬ 
mation  from  NWS  or  a  similar  system  would  be  required. 
Once  obtained,  these  characteristics  are  used  as  inputs  to 
the  model  without  changing  the  model  structure. 

Portability  and  heterogeneity  are  also  important.  The 
AppLeS  itself  is  written  in  C  and  Perl  and  has  been  com¬ 
piled  successfully  and  executed  on  various  architectures  and 
systems  (Pentium,  Alpha,  Linux  and  Solaris).  Initial  results 
indicate  that  the  scheduler  can  be  used  effectively  on  dif¬ 
ferent  target  environments  without  changes  to  the  structure 
of  the  performance  model.  For  example,  we  used  mpich  on 
a  local  cluster  for  initial  development  and  debugging.  The 
schedule  worked  well  with  only  the  previously  described 
changes  in  model  input  parameters. 
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Abstract 

Recently,  we  presented  two  very  low-cost  approaches  to 
compile-time  list  scheduling  where  the  tasks’  priorities 
are  computed  statically  or  dynamically,  respectively.  For 
homogeneous  systems,  these  two  algorithms,  called  FCP 
and  FLB,  have  shown  to  yield  a  performance  equivalent  to 
other  much  more  costly  algorithms  such  as  MCP  and  ETF. 
In  this  paper  we  present  modified  versions  of  FCP  and 
FLB  targeted  to  heterogeneous  systems.  We  show  that 
the  modified  versions  yield  a  good  overall  performance, 
which  is  generally  comparable  to  algorithms  specifically 
designed  for  heterogeneous  systems,  such  as  HEFT  or 
ERT.  There  are  a  few  cases,  mainly  for  irregular  problems 
and  large  processor  speed  variance,  where  FCP  and  FLB’s 
performance  drops  down  to  32%  and  63%,  respectively. 
Considering  the  good  overall  performance  and  their  very 
low  cost  however,  FCP  and  FLB  are  interesting  options 
for  scheduling  very  large  problems  on  heterogeneous  sys¬ 
tems. 

Keywords:  compile-time  task  scheduling,  list  schedul¬ 
ing,  low-cost,  heterogeneous  systems 

1  Introduction 

Heterogeneous  systems  have  recently  become  widely 
used  as  a  cheap  way  of  obtaining  a  parallel  system.  Clus¬ 
ters  of  workstations  connected  by  high-speed  networks, 
or  simply  the  Internet  are  common  examples  of  hetero¬ 


geneous  systems.  However,  in  order  to  obtain  high- 
performance  from  such  a  system,  both  compile-time  and 
runtime  support  is  necessary,  in  which  scheduling  the  ap¬ 
plication  to  the  parallel  system  is  a  crucial  factor.  The 
problem,  known  as  task  scheduling,  has  been  shown  to  be 
NP-complete  [3]. 

The  general  problem  of  task  scheduling  has  been  exten¬ 
sively  studied,  mainly  for  homogeneous  systems.  Various 
heuristics  have  been  proposed,  including  list  algorithms 
[4,  11,  12,  13,  20],  multi-step  algorithms  [14,  15,  22], 
duplication  based  algorithms  [7,  2,  1],  genetic  algo¬ 
rithms  [18],  algorithms  using  local  search  [21],  bin  pack¬ 
ing  [19],  or  graph  decomposition  [6].  Within  all  these  ap¬ 
proaches,  list  scheduling  has  been  shown  to  have  a  good 
cost-performance  trade-off,  as  considering  its  low  cost, 
the  performance  is  still  very  good  [8,  13,  12].  The  low- 
cost  is  a  key  issue  for  large  problems,  in  which  even  a 
0(V 2)  algorithm,  where  V  is  the  number  of  tasks,  may 
have  a  prohibitive  cost. 

Task  scheduling  has  also  been  studied  in  the  specific 
context  of  heterogeneous  systems  ([5,  9,  10,  16,  17]).  It 
has  been  shown  that  minimizing  the  tasks’  completion 
time  throughout  the  schedule  is  preferable  to  minimizing 
the  tasks’  start  time  [10,  17],  With  respect  to  list  schedul¬ 
ing  algorithms,  one  can  note  that  most  of  them  can  be  eas¬ 
ily  modified  to  meet  the  task’s  completion  time  minimiza¬ 
tion  criterion,  and  thus  obtain  good  performance  also  in 
the  heterogeneous  case  (e.g.,  HEFT  [17]  and  ERT  [9]  are 
the  versions  using  the  tasks’  completion  time  as  the  task 
priority  of  MCP  [20]  and  ETF  [4],  respectively).  How¬ 
ever,  two  very  low-cost  list  scheduling  algorithms  that  we 
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proposed  recently,  namely  FCP  (Fast  Critical  Path)  [13] 
and  FLB  (Fast  Load  Balancing)  [12],  cannot  be  modi¬ 
fied  in  such  an  easy  way  without  sacrificing  their  com¬ 
petitively  low  cost. 

In  this  paper  we  present  the  modifications  required  to 
obtain  a  good  performance  from  FCP  and  FLB  in  het¬ 
erogeneous  systems.  We  show  that  the  modified  ver¬ 
sions  of  FCP  and  FLB  yield  a  good  overall  performance, 
which  is  generally  comparable  to  algorithms  specifically 
designed  for  heterogeneous  systems,  such  as  HEFT  (Het¬ 
erogeneous  Earliest-Finish-Time)  [17]  and  ERT  (Earliest 
Task  First)  [9].  There  are  a  few  cases,  mainly  for  irregular 
problems  and  wide  processor  speed  ranges,  in  which  FCP 
and  FLB’s  performance  drops  down  to  32%  and  63%,  re¬ 
spectively.  Considering  their  very  low  cost  and  reason¬ 
ably  good  performance,  we  believe  that  FCP  and  FLB  are 
interesting  options  for  task  scheduling  in  heterogeneous 
systems,  especially  for  large  problems  where  scheduling 
time  would  otherwise  be  prohibitive. 

This  paper  is  organized  as  follows:  The  next  two  sec¬ 
tions  briefly  describe  the  scheduling  problem,  and  the 
FCP  and  FLB  algorithms,  respectively.  In  Section  4  we 
study  their  performance  for  heterogeneous  systems.  Sec¬ 
tion  5  concludes  the  paper. 

2  Preliminaries 

The  task  scheduling  algorithm  input  is  a  directed  acyclic 
graph  Q  =  (V,  £ ),  that  models  a  parallel  program,  where 
V  is  a  set  of  V  nodes  and  £  is  a  set  of  E  edges.  A  node 
in  the  DAG  represents  a  task,  containing  instructions  that 
execute  sequentially  without  preemption.  Each  task  is  as¬ 
sumed  to  have  a  computation  cost .  The  edges  correspond 
to  task  dependencies  (communication  messages  or  prece¬ 
dence  constraints)  and  have  a  communication  cost.  The 
communication-to-computation  ratio  (CCR)  of  a  paral¬ 
lel  program  is  defined  as  the  ratio  between  its  average 
communication  and  computation  costs.  If  two  tasks  are 
scheduled  to  the  same  processor,  the  communication  cost 
between  them  is  assumed  to  be  zero.  The  task  graph 
width  (W)  is  defined  as  the  maximum  number  of  tasks 
that  are  not  connected  through  a  path. 

A  task  with  no  input  edges  is  called  an  entry  task,  while 
a  task  with  no  output  edges  is  called  an  exit  task.  The 
task’s  bottom  level  is  defined  as  the  longest  path  from  the 


current  task  to  any  exit  task,  where  the  path  length  is  the 
sum  of  the  computation  and  communication  costs  of  the 
tasks  and  edges  belonging  to  the  path.  A  task  is  said  to  be 
ready  if  all  its  parents  have  finished  their  execution.  Note 
that  at  any  given  time  the  number  of  ready  tasks  never 
exceeds  W.  A  task  can  start  its  execution  only  after  all  its 
messages  have  been  received. 

As  a  distributed  system  we  assume  a  set  V  of  P  pro¬ 
cessors  connected  in  a  clique  topology  in  which  inter¬ 
processor  communication  is  assumed  to  perform  with¬ 
out  contention.  The  processors’  computing  speeds  differ 
and  are  represented  as  fractions  of  the  slowest  processor 
speed.  We  assume  that  the  task  execution  time  is  pro¬ 
portional  with  the  speed  of  the  processor  it  is  executed 
on,  and  consists  of  the  computation  cost  multiplied  by  the 
processor  speed. 

In  our  algorithms,  an  important  concept  is  that  of  the 
enabling  processor  of  a  ready  task  t,  EP(t ),  which  is  the 
processor  from  which  the  last  message  arrives.  Given  a 
partial  schedule  and  a  ready  task  £,  the  task  is  said  to  be  of 
type  EP  if  its  last  message  arrival  time  is  greater  than  the 
ready  time  of  its  enabling  processor  and  of  type  non-EP 
otherwise.  Thus,  an  EP  type  task  starts  the  earliest  on  its 
enabling  processor. 

3  The  Algorithms 

List  scheduling  algorithms  use  two  approaches  to  sched¬ 
ule  tasks.  The  first  category  is  the  static  list  schedul¬ 
ing  algorithms  (e.g.,  MCP  [20],  DPS  [11],  HEFT  [17], 
FCP  [13])  that  schedule  the  tasks  in  the  order  of  their  pre¬ 
viously  computed  priorities.  A  task  is  usually  scheduled 
on  the  processor  that  gives  the  earliest  start  time  for  the 
given  task.  Thus,  at  each  scheduling  step,  first  the  task  is 
selected  and  afterwards  its  destination  processor. 

The  second  approach  is  dynamic  list  scheduling 
(e.g.  ETF  [4],  ERT  [9],  FLB  [12]).  In  this  case,  the  tasks 
do  not  have  a  precomputed  priority.  At  each  scheduling 
step,  each  ready  task  is  tentatively  scheduled  to  each  pro¬ 
cessor,  and  the  best  Ctask,  processor >  pair  is  selected 
(e.g.,  the  ready  task  that  starts  the  earliest  on  the  proces¬ 
sor  where  this  earliest  start  time  is  obtained  for  ETF,  or 
the  ready  task  that  finishes  the  earliest  on  the  processor 
where  this  earliest  finish  time  is  obtained  for  ERT).  Thus, 
at  each  step  both  the  task  and  its  destination  processor  are 
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selected  at  the  same  time. 

Both  static  and  dynamic  approaches  of  list  schedul¬ 
ing  have  their  advantages  and  drawbacks  in  terms  of  the 
schedule  quality  they  produce.  Static  approaches  are  more 
suited  for  communication-intensive  and  irregular  prob¬ 
lems,  where  selecting  important  tasks  first  is  more  crucial. 
Dynamic  approaches  are  more  suited  for  computation¬ 
intensive  applications  with  a  high  degree  of  parallelism, 
because  these  algorithms  focus  on  obtaining  a  good  pro¬ 
cessor  utilization. 

FCP  (Fast  Critical  Path)  [13]  and  FLB  (Fast  Load  Bal¬ 
ancing)  [12]  significantly  reduce  the  cost  of  the  static  and 
dynamic  list  scheduling  approaches,  respectively.  In  the 
next  two  sections,  we  describe  both  algorithms  and  we 
outline  the  differences  between  them  and  previous  list 
scheduling  algorithms. 

3.1  FCP 

Static  list  scheduling  algorithms  have  three  important 
steps:  (a)  task  priorities  computation ,  that  takes  at  least 
0(E  4-  V)  time,  since  the  whole  task  graph  has  to  be 
traversed,  (b)  task  selection  according  to  their  priori¬ 
ties,  that  takes  0{V  \ogW)  time,  and  (c)  processor  se¬ 
lection,  that  selects  the  “best”  processor  for  the  previ¬ 
ously  selected  task,  usually  the  processor  where  the  cur¬ 
rent  task  starts/finishes  the  earliest.  Processor  selection 
takes  0((E  +  V)P)  time,  since  each  task  is  tentatively 
scheduled  to  each  processor.  Thus,  the  highest  complex¬ 
ity  steps  are  the  task  and  processor  selection  steps,  which 
determine  the  0(V  log  (W)+(E+V)P)  time  complexity 
of  the  static  list  scheduling  algorithms 

In  FCP,  the  processor  selection  complexity  is  signifi¬ 
cantly  reduced  by  restricting  the  choice  for  the  destina¬ 
tion  processor  from  all  processors  to  only  two  proces¬ 
sors:  (a)  the  task’s  enabling  processor,  or  (b)  the  processor 
which  becomes  idle  the  earliest.  In  [13]  we  prove  that  the 
start  time  of  a  given  task  is  minimized  by  selecting  one 
of  these  two  destination  processors.  The  proof  is  based 
on  the  fact  that  the  start  time  of  a  task  t  on  a  candidate 
processor  p  is  defined  as  the  maximum  between  (a)  the 
time  the  last  message  to  t  arrives,  and  (b)  the  time  p  be¬ 
comes  idle.  As  the  above-mentioned  processors  minimize 
the  two  components  of  the  task’s  start  time,  respectively, 
it  follows  that  one  of  the  two  processors  minimizes  the 
task’s  start  time.  Consequently,  the  algorithm’s  perfor¬ 


mance  is  not  affected,  while  the  time  complexity  is  dras¬ 
tically  reduced  from  0((E  +  V)P)  to  0(V  log  ( P )  +  E). 

The  task  selection  complexity  can  be  reduced  by  main¬ 
taining  only  a  constant  size  sorted  list  of  ready  tasks. 
Thus,  we  sort  as  many  tasks  as  they  fit  in  the  fixed  size 
sorted  list,  while  the  others  are  stored  in  an  unsorted  FIFO 
list  which  has  an  0(1)  access  time.  The  time  complex¬ 
ity  of  sorting  tasks  using  a  list  of  size  H  decreases  to 
0(V\ogH)  as  all  the  tasks  are  enqueued  and  dequeued 
in  the  sorted  list  only  once.  We  have  found  that  for 
FCP,  which  uses  bottom  level  as  task  priority,  a  size  of 
P  is  required  to  achieve  a  performance  comparable  to 
the  original  list  scheduling  algorithm  (see  Section  4).  A 
sorted  list  size  of  P  results  in  a  task  sorting  complexity  of 
0{V\og  P). 

Using  the  described  techniques  for  task  sorting  and 
processor  selection  the  total  time  complexity  of  FCP 
( 0(V  log  (P)  +  E ))  is  clearly  a  significant  improvement 
over  the  time  complexity  of  typical  list  scheduling  ap¬ 
proaches  with  statically  computed  priority. 

3.2  FLB 

In  FLB,  at  each  iteration  of  the  algorithm,  the  ready  task 
that  can  start  the  earliest  is  scheduled  to  the  processor  on 
which  that  start  time  is  achieved.  Note  that  FLB  uses  the 
same  task  selection  criterion  as  in  ETF.  In  contrast  to  ETF 
however,  the  preferred  task  and  its  destination  processor 
are  identified  in  O (log  (W)  +  log(P))  time  instead  of 
0(WP). 

To  select  the  earliest  starting  task,  pairs  of  a  ready  task 
and  the  processor  on  which  the  task  starts  the  earliest  need 
to  be  considered.  As  shown  earlier,  in  order  to  obtain  the 
earliest  start  time  of  a  ready  task  on  a  partial  schedule, 
the  given  task  must  be  scheduled  either  (a)  to  the  task’s 
enabling  processor,  or  (b)  to  the  processor  becoming  idle 
the  earliest. 

Given  a  partial  schedule,  there  are  only  two  pairs  task- 
processor  that  can  achieve  the  minimum  start  time  for  a 
task:  (a)  the  EP  type  task  t  with  the  minimum  estimated 
start  time  EST(t ,  EP(t))  on  its  enabling  processor,  and 
(b)  the  non-EP  type  task  t'  with  the  minimum  last  message 
arrival  time  LMT(t')  on  the  processor  becoming  idle  the 
earliest.  The  first  case  minimizes  the  earliest  start  time 
of  the  EP  type  tasks,  while  the  second  case  minimizes  the 
earliest  start  time  of  the  non-EP  type  tasks.  If  in  both  cases 
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Figure  1 :  Miniature  task  graphs 


the  same  earliest  start  time  is  obtained,  the  non-EP  type 
task  is  preferred,  because  the  communication  caused  by 
the  messages  sent  from  the  task’s  predecessors  are  already 
overlapped  with  the  previous  computation.  Considering 
the  two  cases  discussed  above  guarantees  that  the  ready 
task  with  the  earliest  start  time  will  be  identified.  A  formal 
proof  is  given  in  [12]. 

To  reduce  the  complexity  even  further,  the  same 
scheme  as  in  FCP  can  be  used.  Instead  of  maintaining  all 
EP  and  non-EP  tasks  sorted,  only  a  fixed  number  of  tasks 
are  stored  sorted,  while  the  other  are  stored  in  FIFO  order. 
The  FLB’s  complexity  is  reduced  to  0(V  log  ( P )  +  E), 
while  the  performance  is  maintained  at  a  level  comparable 
to  using  the  fully  sorted  task  lists  (see  Section  4). 

3.3  The  Modifications 

As  mentioned  earlier,  task  scheduling  algorithms  for  het¬ 
erogeneous  systems  perform  better  when  they  sort  tasks 
by  their  finish  time  rather  than  start  time.  The  reason  is 
that  sorting  by  finish  time  implicitly  takes  into  considera¬ 
tion  processor  speeds.  However,  in  order  to  maintain  their 
very  low  complexity,  FCP  and  FLB  must  sort  the  tasks 
according  to  their  start  time.  As  a  consequence,  the  pro¬ 
cessor  speed  is  not  considered  when  scheduling  a  non-EP 
task,  but  only  the  time  the  processors  becomes  idle. 

To  overcome  this  deficiency,  we  change  the  priority  cri¬ 
terion  for  processors  for  both  FCP  and  FLB.  Instead  of 
using  the  time  the  processor  becomes  idle  the  earliest  as  a 
priority,  we  now  use  the  sum  of  the  processor  idle  time  and 


the  mean  task  execution  time.  Using  this  priority  scheme, 
we  are  now  able  to  incorporate  the  processor  speed  when 
selecting  the  processor  for  a  non-EP  task.  This  is  a  raw 
approximation  of  finding  the  processor  where  a  non-EP 
type  task  finishes  the  earliest. 

In  FLB,  we  also  modify  the  task  priority  for  the  EP-type 
tasks.  The  EP-type  tasks  are  sorted  by  their  finish  time  on 
their  enabling  processor  instead  of  their  start  time. 

Finally,  for  both  FCP  and  FLB,  we  change  the  final 
choice  between  the  two  candidate  tasks,  by  selecting  the 
task  finishing  the  earliest  instead  of  the  task  starting  the 
earliest. 

Note,  that  all  these  modifications  of  FCP  and  FLB  do 
not  involve  any  extra  cost  compared  to  the  original  ver¬ 
sions.  As  a  consequence,  the  cost  of  both  FCP  and  FLB 
is  maintained  at  the  same  very  low  level. 

4  Performance  Results 

The  FCP  and  FLB  algorithms  are  compared  with 
ERT  (Earliest  Task  First)  [9]  and  HEFT  (Heterogeneous 
Earliest-Finish-Time)  [17].  ERT  (0{W(E  +  V)P))  and 
HEFT  ( 0(V\ogW  +  (E  +  V)P))  are  well-known  and 
have  been  shown  to  obtain  competitive  results  in  hetero¬ 
geneous  systems  [9,  17]. 

For  both  FCP  and  FLB  we  used  two  versions.  The  first 
version  uses  fully  sorted  task  lists.  For  this  first  version, 
FCP  and  FLB  have  exactly  the  same  scheduling  criteria 
as  MCP  and  ETF,  respectively.  The  second  version  uses 
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Figure  2:  Cost  comparison 


partially  sorted  priority  lists  of  size  P.  We  call  the  first 
version  of  the  algorithms  FCP-f  and  FLB-f,  and  the  sec¬ 
ond  FCP-p  and  FLB-p,  respectively. 

We  consider  task  graphs  representing  various  types  of 
parallel  algorithms.  The  selected  problems  are  LU  decom¬ 
position  (“LU”),  Laplace  equation  solver  (“Laplace”)  and 
a  stencil  algorithm  (“Stencil”).  For  each  of  these  prob¬ 
lems,  we  adjusted  the  problem  size  to  obtain  task  graphs 
of  about  2000  nodes.  For  each  problem,  we  varied  the 
task  graph  granularities,  by  varying  the  communication- 
to-computation  ratio  ( CCR ).  The  values  used  for  CCR 
are  0.2  and  5.0.  For  each  problem  and  each  CCR  value, 
we  generated  5  graphs  with  random  execution  times  and 
communication  delays  (i.i.d.  uniform  distribution  with 
unit  coefficient  of  variation),  the  results  being  the  aver¬ 
age  over  the  5  graphs  (in  view  of  the  low  overall  variance, 
5  samples  are  sufficient).  Miniature  task  graphs  samples 
of  each  type  are  shown  in  Figure  1 . 

We  schedule  the  task  graphs  on  2,  4,  8,  16  and  32 
processors.  For  each  P ,  we  use  10  heterogeneous  con¬ 
figurations  in  which  the  processors’  speed  are  uniformly 
distributed  over  the  following  intervals:  [8,12],  [6,14] 
and  [4,16].  Thus,  the  total  number  of  test  configura¬ 
tions  is  3  (problems)  x  2  (CCR)  x  5  (sample  graphs)  x 
5  (processor  ranges)  x  10  (processor  configurations)  x 
3  (processor  intervals)  =  5500. 

4.1  Running  Times 

In  Fig.  2  the  average  running  time  of  the  algorithms 
is  shown  in  CPU  seconds  as  measured  on  a  Pentium 


Pro/300MHz  PC  with  64Mb  RAM  running  Linux  2.0.32. 
ERT  is  the  most  costly  among  the  compared  algorithms. 
Its  cost  increases  from  72  ms  for  2  processors  up  to  11  s 
for  64  processors  (we  do  not  include  ERT’s  running  times 
for  P  >  16  in  Figure  2  due  to  their  too  much  higher  val¬ 
ues).  HEFT’s  cost  also  increases  with  the  number  of  pro¬ 
cessors,  but  it  is  significantly  lower.  For  P  =  2,  it  runs 
for  17  ms,  while  for  P  =  64,  the  running  time  is  279  ms. 

Both  versions  of  the  FCP  and  FLB  have  considerably 
lower  running  times.  FCP-p’s  running  time  is  the  lowest, 
varying  from  16  ms  for  P  =  2  to  25  ms  for  P  =  64. 
FCP-f  varies  from  21  ms  for  P  =  2  to  24  ms  for  P  =  64. 
One  can  note  that  for  larger  number  of  processors  both 
versions  of  FCP  have  the  same  running  times.  The  reason 
is  that  the  ready  tasks  fit  in  the  sorted  part  of  the  FCP-f ’s 
priority  list. 

FLB  has  a  slightly  higher  cost  compared  to  FCP,  be¬ 
cause  of  the  more  complicated  task  and  processor  selec¬ 
tion  schemes.  The  running  times  vary  around  26  ms  and 
24  ms  for  FLB-f  and  FLB-p,  respectively.  Their  running 
times  do  not  vary  significantly  with  the  number  of  proces¬ 
sors.  One  can  note  that  for  larger  number  of  processors, 
FCP  and  FLB’s  running  times  tend  to  become  similar. 

4.2  Scheduling  Performance 

In  this  section  we  study  how  the  FCP  and  FLB  algorithms 
perform.  We  first  compare  FCP  and  FLB’s  performance 
to  ERT  and  HEFT’s  performance,  with  respect  to  gran¬ 
ularity,  problem  type  and  processor  heterogeneity.  Next, 
we  show  the  speedups  achieved  by  FCP  and  FLB. 
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Figure  3:  Performance  comparison  with  respect  to  the  problem 


For  performance  comparison,  we  use  the  normalized 
schedule  length  ( NSL ),  defined  as  the  ratio  between  the 
schedule  length  of  the  given  algorithm  and  the  schedule 
length  of  ERT. 

In  Figure  3  we  study  the  algorithms’  performance  with 
respect  to  the  problem  type  by  comparing  the  schedule 
lengths  averaged  over  the  three  processor  speed  intervals. 
One  can  note  that  for  both  FCP  and  FLB,  the  partial 
versions  obtain  performance  similar  to  the  full  versions. 
Therefore  we  will  further  refer  only  to  the  partial  versions 
of  FCP  and  FLB. 

One  can  note  that  the  overall  performance  of  FCP  is 
comparable  to  ERT’s  performance,  although  at  a  much 
lower  cost.  For  problems  involving  a  large  number  of  fork 
and  join  tasks,  such  as  LU  and  Laplace,  for  a  large  number 
of  processors  ERT  performs  better,  up  to  16%  for  both 
coarse  and  fine-grain  cases  (Laplace,  P  =  32).  For  all 
the  other  cases  (i.e.,  for  regular  problems,  such  as  Stencil, 
or  for  small  number  of  processors)  FCP  performs  equal 
or  better  compared  to  ERT,  up  to  8%  (Stencil,  P  =  32) 
and  7%  (LU,  P  =  16)  for  coarse  and  fine-grain  problems, 


respectively. 

Compared  to  HEFT,  FCP  is  outperformed  for  problems 
involving  a  large  number  of  fork  and  join  tasks,  such  as 
LU  and  Laplace,  for  a  large  number  of  processors,  with 
up  to  27%  (Laplace,  P  =  32)  and  23%  (LU,  P  =  32) 
for  coarse  and  fine-grain  cases,  respectively.  However, 
in  all  the  other  cases  (i.e.,  for  regular  problems,  such  as 
Stencil,  or  for  small  number  of  processors)  FCP  performs 
comparable  to  HEFT. 

FLB’s  performance  is  generally  worse,  being  outper¬ 
formed  by  ERT,  HEFT  and  FCP  by  up  to  46%,  57%,  and 
30%  (all  for  coarse-grain  Laplace,  P  =  32%),  respec¬ 
tively.  However,  even  for  FLB,  the  performance  becomes 
comparable  to  the  other  three  algorithms  for  regular  prob¬ 
lems,  such  as  Stencil,  or  small  number  of  processors. 

In  Figure  4  we  study  the  influence  of  the  heterogene¬ 
ity  to  the  performance.  The  results  are  averaged  over  the 
LU,  Laplace  and  Stencil  problems.  Again,  both  FCP  and 
FLB  obtain  similar  performance  for  the  full  and  partial 
versions. 

Again,  the  overall  performance  of  FCP  is  comparable 
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Figure  4:  Performance  comparison  with  respect  to  heterogeneity 


to  ERT’s  performance.  For  a  large  processor  speed  vari¬ 
ance  (i.e.,  4  —  16)  and  for  a  large  number  of  processors 
ERT  performs  better,  up  to  15%  and  12%  for  coarse  and 
fine-grain  cases  (4  -  16  processor  speed  range,  P  =  32), 
respectively.  For  all  the  other  cases  (i.e.,  small  processor 
speed  variance,  or  for  small  number  of  processors)  FCP 
performs  equal  or  even  better  compared  to  ERT,  up  to  8% 
and  12%  (both  for  Stencil,  P  =  16)  for  coarse  and  fine- 
grain  problems,  respectively. 

Compared  to  HEFT,  FCP  is  also  outperformed  for  a 
large  processor  speed  variance  and  for  a  large  number  of 
processors,  with  up  to  28%  and  26%  (both  for  4-16  pro¬ 
cessor  speed  range,  P  —  32)  for  coarse  and  fine-grain 
cases,  respectively.  However,  for  small  processor  speed 
variance,  or  for  small  number  of  processors,  FCP’s  per¬ 
formance  tends  to  become  comparable  to  HEFT. 

FLETs  performance  is  generally  worse,  being  outper¬ 
formed  by  ERT,  HEFT  and  FCP  with  up  to  50%,  63%,  and 
35%  (all  for  4-16  processor  speed  range,  coarse-grain 
problems,  P  =  32%),  respectively.  However,  even  for 


FLB,  the  performance  becomes  comparable  to  the  other 
three  algorithms  for  regular  problems,  such  as  Stencil,  or 
small  number  of  processors. 

One  can  note  that  for  heterogeneous  systems,  the  ver¬ 
sions  using  fully  and  partially  sorted  priority  lists  perform 
comparable  for  both  FCP  and  FLB.  Similar  to  homoge¬ 
neous  systems,  a  partially  sorted  list  of  size  P  yields  com¬ 
petitive  results,  while  the  scheduling  complexity  becomes 
extremely  low:  0(V  log  (P)  +  E). 

Figures  5  and  6  show  the  speedups  achieved  for  the 
FCP  and  FLB  algorithms  respectively.  Although  FCP  per¬ 
forms  better,  the  two  algorithms  perform  similar  with  re¬ 
spect  to  problem  type,  granularity  and  processor  speed 
range.  For  Stencil  the  speedup  is  almost  linear.  How¬ 
ever,  for  LU  and  Laplace  the  speedup  starts  leveling  off 
for  more  than  32  processors.  The  reason  is  that  LU 
and  Laplace  have  a  large  number  of  fork  and  join  nodes, 
and  as  a  consequence  a  limited  parallelism,  while  Sten¬ 
cil  is  a  regular  problem  with  a  large  and  constant  paral¬ 
lelism.  Also,  one  can  note  that  for  a  large  processor  speed 
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variance  (i.e.,  4  —  16)  and  a  large  number  of  processors 
(P  =  32)  the  speedup  is  lower  compared  to  a  small  pro¬ 
cessor  speed  variance.  Also,  for  fine-grain  problems,  the 
speedup  is  lower  for  a  large  number  of  processors.  In 
both  cases  the  reason  is  that  there  are  not  enough  tasks 
to  fully  utilize  the  existing  processors,  and,  as  FCP  and 
FLB  are  not  specifically  designed  for  heterogeneous  pro¬ 
cessors,  they  do  not  always  select  the  faster  processors 
first. 


5  Conclusion 

In  this  paper  we  investigate  the  performance  of  the  low- 
cost  static  list  scheduling  algorithm  FCP  and  dynamic  list 
scheduling  algorithm  FLB,  modified  to  schedule  applica¬ 
tions  for  heterogeneous  systems.  We  show  that  making 
minimal  modifications  that  do  not  affect  their  very  low 
cost,  FCP  and  FLB  still  obtain  good  performance  in  het¬ 
erogeneous  systems,  at  a  cost  that  is  considerably  below 
typical  scheduling  algorithms  for  heterogeneous  systems. 

We  show  that  the  performance  of  the  modified  versions 
of  FCP  and  FLB  is  generally  comparable  to  algorithms 
specifically  designed  for  heterogeneous  systems,  such  as 
HEFT  and  ERT.  There  are  only  a  few  cases,  mainly  for 
irregular  problems  and  large  processor  speed  variance, 
where  FCP  and  FLB’s  performance  drops  down  to  32% 
and  63%,  respectively. 

Considering  the  overall  performance  and  their  very  low 
cost  compared  to  the  other  algorithms,  we  believe  FCP 
and  FLB  to  be  interesting  compile-time  candidates  for 
heterogeneous  systems,  especially  considering  the  large 
problem  sizes  that  are  used  in  practice. 

References 

[1]  I.  Ahmad  and  Y.-K.  Kwok.  A  new  approach  to 
scheduling  parallel  programs  using  task  duplication. 
In  Proc.  Int’l  Conf.  on  Parallel  Processing,  1994. 

[2]  Y.  C.  Chung  and  S.  Ranka.  Application  and  perfor¬ 
mance  analysis  of  a  compile-time  optimization  ap¬ 
proach  for  list  scheduling  algorithms  on  distributed- 
memory  multiprocessors.  In  Proc.  Supercomputing, 
1992. 


[3]  R.  L.  Graham.  Bounds  on  multiprocessing  timing 
anomalies.  SIAM  Journal  on  Applied  Mathematics, 
17(2):4 16-429,  Mar.  1969. 

[4]  J.-J.  Hwang,  Y.-C.  Chow,  F.  D.  Anger,  and  C.-Y. 
Lee.  Scheduling  precedence  graphs  in  systems  with 
interprocessor  communication  times.  SIAM  Journal 
on  Computing,  18:244-257,  Apr.  1989. 

[5]  M.  Kafil  and  I.  Ahmad.  Optimal  task  assignment  in 
heterogeneous  computing  systems.  In  Proc.  Hetero¬ 
geneous  Computing  Workshop,  1997. 

[6]  A.  A.  Khan,  C.  L.  McCreary,  and  M.  S.  Jones.  A 
comparison  of  multiprocessor  scheduling  heuristics. 
In  Proc.  Int’l  Conf.  on  Parallel  Processing,  1994. 

[7]  B.  Kruatrachue  and  T.  G.  Lewis.  Grain  size  determi¬ 
nation  for  parallel  processing.  IEEE  Software,  pages 
23-32,  Jan.  1988. 

[8]  Y.-K.  Kwok  and  I.  Ahmad.  Benchmarking  the  task 
graph  scheduling  algorithms.  In  Proc.  Int’l  Paral¬ 
lel  Processing  Symp.  /  Symp.  on  Parallel  and  Dis¬ 
tributed  Processing,  1998. 

[9]  C.-Y.  Lee,  J.-J.  Hwang,  Y.-C.  Chow,  and  F.  D. 
Anger.  Multiprocessor  scheduling  with  interpro¬ 
cessor  communication  delays.  Operations  Research 
Letters,  7:141-147,  June  1988. 

[10]  M.  Maheswaran  and  H.  J.  Siegel.  A  dynamic 
matching  and  scheduling  algorithm  for  heteroge¬ 
neous  computing  systems.  In  Proc.  Heterogeneous 
Computing  Workshop,  1998. 

[11]  G.-L.  Park,  B.  Shirazi,  J.  Marquis,  and  H.  Choo. 
Decisive  path  scheduling:  A  new  list  scheduling 
method.  In  Proc.  Int’l  Conf  on  Parallel  Processing, 
1997. 

[12]  A.  Radulescu  and  A.  J.  C.  van  Gemund.  FLB:  Fast 
load  balancing  for  distributed-memory  machines.  In 
Proc.  Int’l  Conf.  on  Parallel  Processing,  1999. 

[13]  A.  Radulescu  and  A.  J.  C.  van  Gemund.  On  the  com¬ 
plexity  of  list  scheduling  algorithms  for  distributed- 
memory  systems.  In  Proc.  ACM  Int’l  Conf.  on  Su¬ 
percomputing,  1999. 


237 


[14]  A.  Radulescu,  A.  J.  C.  van  Gemund,  and  H.-X.  Lin. 
LLB:  A  fast  and  effective  scheduling  algorithm  for 
distributed-memory  systems.  In  Proc.  Int’l  Paral¬ 
lel  Processing  Symp.  /  Symp.  on  Parallel  and  Dis¬ 
tributed  Processing,  pages  525-530, 1999. 

[15]  V.  Sarkar.  Partitioning  and  Scheduling  Parallel  Pro¬ 
grams  for  Execution  on  Multiprocessors.  PhD  the¬ 
sis,  MIT,  1989. 

[16]  M.  Tan,  H.  J.  Siegel,  J.  K.  Antonio,  and  Y.  A.  Li. 
Minimizing  the  application  execution  time  through 
scheduling  of  subtasks  and  communication  traffic  in 
a  heterogeneous  computing  system.  IEEE  Trans, 
on  Parallel  and  Distributed  Systems,  8(8):  857-871, 
Aug.  1997. 

[17]  H.  Topcuoglu,  S.  Hariri,  and  M.-Y.  Wu.  Task 
scheduling  algorithms  for  heterogeneous  proces¬ 
sors.  In  Proc.  Heterogeneous  Computing  Workshop, 
1999. 

[18]  L.  Wang,  H.  J.  Siegel,  V.  P.  Roychowdhury,  and 
A.  A.  Maciejewski.  Task  matching  and  scheduling 
in  heterogeneous  computing  environments  using  a 
genetic-algorithm-based  approach.  Journal  of  Par¬ 
allel  and  Distributed  Computing,  47:8-22, 1997. 

[19]  C.  M.  Woodside  and  G.  G.  Monforton.  Fast  allo¬ 
cation  of  processes  in  distributed  and  parallel  sys¬ 
tems.  IEEE  Trans,  on  Parallel  and  Distributed  Sys¬ 
tems,  4(2):  164-174,  Feb.  1993. 

[20]  M.-Y.  Wu  and  D.  D.  Gajski.  Hypertool:  A  program¬ 
ming  aid  for  message-passing  systems.  IEEE  Trans, 
on  Parallel  and  Distributed  Systems,  l(7):330-343, 
July  1990. 

[21]  M.-Y.  Wu,  W.  Shu,  and  J.  Gu.  Local  search  for  dag 
scheduling  and  task  assignment.  In  Proc.  Int’l  Conf. 
on  Parallel  Processing,  1997. 

[22]  T.  Yang  and  A.  Gerasoulis.  Pyrros:  Static  task 
scheduling  and  code  generation  for  message  pass¬ 
ing  multiprocessors.  In  Proc.  ACM  Int’l  Conf.  on 
Supercomputing,  1992. 


Biographies 

Andrei  Radulescu  received  a  MSc  degree  in  Com¬ 
puter  Science  in  1995  from  “Politehnica”  University  of 
Bucharest.  Between  1995  and  1997  he  was  a  teaching  as¬ 
sistant  at  the  “Politehnica”  University  of  Bucharest.  Since 
1997,  he  is  a  PhD  student  at  the  Department  of  Infor¬ 
mation  Technology  and  Systems  of  Delft  University  of 
Technology.  His  research  interests  are  in  multiprocessor 
scheduling,  software  support  for  parallel  computing  and 
parallel  and  distributed  systems  programming, 

Arjan  J.C.  van  Gemund  received  a  BSc  in  Physics  in 
1981,  a  MSc  degree  (cum  laude)  in  Computer  Science 
in  1989,  and  a  PhD  (cum  laude)  in  1996,  all  from  Delft 
University  of  Technology.  In  1981  he  joined  the  R  & 
D  organization  of  a  Dutch  multinational  company  as  an 
Electrical  Engineer  and  Systems  Programmer.  Between 
1989  and  1992  he  joined  the  Dutch  TNO  research  orga¬ 
nization  as  a  Research  Scientist  specialized  in  the  field  of 
high-performance  computing.  Since  1992,  he  works  at 
the  Department  of  Information  Technology  and  Systems 
of  Delft  University  of  Technology,  currently  as  Associate 
Professor.  His  research  interests  are  in  the  area  of  paral¬ 
lel  and  distributed  systems  programming,  scheduling,  and 
performance  modeling. 


238 


SESSION  4-A 
GRID  APPLICATIONS 


Chair:  I.  Pramanick,  Sun  Microsystems,  USA 


Combining  Workstations  and  Supercomputers  to  Support  Grid  Applications: 

The  Parallel  Tomography  Experience 

Shava  Smallen*  Walfredo  Cime*  Jaime  Frey^  Francine  Berman*  Rich  Wolski* 
Mei-Hui  Su§  Carl  Kesselman§  Steve  Young*  Mark  Ellisman* 

*  Computer  Science  and  Engineering  Department  f  Department  of  Computer  Science 
University  of  California,  San  Diego  University  of  Wisconsin 

[ ssmallen ,  walfredo ,  berman]  @ cs.ucsd.edu  jfrey@cs.wisc.edu 

*  Department  of  Computer  Science  §  Information  Sciences  Institute 
University  of  Tennessee  University  of  Southern  California 

rich@cs.utk.edu  [mei,  carl]  @ isi.edu 

*  National  Center  for  Microscopy  and  Imaging  Research 
University  of  California,  San  Diego 

mark@ncmir.ucsd.edu 


Abstract 

Computational  Grids  are  becoming  an  increasingly  im¬ 
portant  and  powerful  platform  for  the  execution  of  large- 
scale ;  resource-intensive  applications.  However,  it  remains 
a  challenge  for  applications  to  tap  into  the  potential  of  Grid 
resources  in  order  to  achieve  performance .  In  this  paper,  we 
illustrate  how  work  queue  applications  can  leverage  Grids 
to  achieve  performance  through  coallocation.  We  describe 
our  experiences  developing  a  scheduling  strategy  for  a  pro¬ 
duction  tomography  application  targeted  to  Grids  that  con¬ 
tain  both  workstations  and  parallel  supercomputers. 

Our  strategy  uses  dynamic  information  exported  by  a 
supercomputer’s  batch  scheduler  to  simultaneously  sched¬ 
ule  tasks  on  workstations  and  immediately  available  super¬ 
computer  nodes.  This  strategy  is  of  great  practical  inter¬ 
est  because  it  combines  resources  available  to  the  typical 
research  lab:  time-shared  workstations  and  CPU  time  in 
remote  space-shared  supercomputers.  We  show  that  this 
strategy  improves  the  performance  of  the  tomography  appli¬ 
cation  compared  to  traditional  scheduling  strategies,  which 
target  the  application  to  either  type  of  resource  alone. 
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9807479,  NIH/NCR  grants  RR04050  and  RR08605,  CAPES  grant 


1.  Introduction 

The  aggregation  of  heterogenous  resources  into  a  Com¬ 
putational  Grid  [9]  provides  a  powerful  platform  for  the  ex¬ 
ecution  of  large-scale  resource-intensive  applications.  The 
simultaneous  use  of  heterogeneous  resources  can  greatly 
improve  the  performance  of  many  applications,  and  permits 
researchers  to  run  applications  at  the  very  large  problem 
sizes  critical  to  the  discovery  of  new  results.  Although  we 
are  gaining  considerable  experience  in  the  development  of 
infrastructures  which  integrate  distributed,  heterogeneous 
resources,  we  have  less  experience  developing  applications 
which  can  leverage  the  distributed  resources  of  the  Grid  to 
improve  performance. 

One  application  which  has  profited  from  leveraging  the 
processing  power  of  the  Computational  Grid  is  the  Paral¬ 
lel  Tomography  (GTOMO)  application  being  used  in  pro¬ 
duction  at  the  National  Center  for  Microscopy  and  Imaging 
Research  (NCMIR).  GTOMO  is  an  embarrassingly-parallel 
application  implemented  with  a  work  queue  scheduling 
strategy.  It  uses  Globus  [10]  services  to  perform  a  3-D  re¬ 
construction  from  a  series  of  images  produced  by  NCMIR’s 
electron  microscope.  As  is  the  case  with  many  laborato¬ 
ries,  NCMIR  owns  a  limited  number  of  workstations  (which 
are  used  as  desktop  machines  and  as  a  platform  for  parallel 
processing)  and  has  access  to  supercomputer  time.  In  this 
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paper,  we  describe  a  coallocation  strategy  for  using  both 
supercomputers  and  interactive  workstation  clusters  to  im¬ 
prove  the  execution  performance  ofGTOMO  within  the  con¬ 
text  of  a  typical  lab  environment. 

The  scheduling  strategy  for  GTOMO  works  at  the 
application-level  to  target  the  application  to  both  interactive 
workstation  clusters  and  supercomputers.  In  an  interactive 
workstation  cluster,  typically  a  time-shared  computational 
platform,  jobs  begin  execution  immediately  but  share  the 
CPU  and  network  with  other  competing  processes.  In  con¬ 
trast,  job  submissions  to  a  supercomputer,  typically  a  space- 
shared  computational  platform,  must  wait  in  a  batch  queue 
until  the  desired  number  of  the  machine’s  processors  be¬ 
come  available  for  dedicated  use.  The  time  an  application 
spends  waiting  in  the  queue  impacts  its  turnaround  time , 
the  time  elapsed  from  the  submission  of  the  application  by 
the  user  until  all  of  the  results  are  available.  Because  the 
queue  wait  time  can  be  quite  lengthy  [20],  an  application’s 
turnaround  time  can  be  relatively  large  compared  to  its  exe¬ 
cution  time.1  Furthermore,  the  queue  wait  times  make  it  dif¬ 
ficult  to  use  supercomputers  and  workstations  concurrently, 
a  strategy  that  could  increase  the  processing  power  avail¬ 
able  to  an  application.  Our  strategy  avoids  unpredictable 
queue  time  delays  by  adaptively  submitting  requests  to  the 
supercomputer  that  can  start  running  immediately. 

The  adaptive  scheduler  developed  for  GTOMO  is  framed 
as  an  AppLeS  [2].  An  AppLeS  application  scheduler  inte¬ 
grates  with  the  target  application  to  develop  a  schedule  for 
deploying  the  application  in  a  shared,  dynamic  Grid  envi¬ 
ronment  [3, 23,  22].  The  scheduler  makes  predictions  of  the 
performance  the  application  may  experience  on  prospec¬ 
tive  resources  at  execution  time.  Using  these  predictions, 
a  potentially  performance-efficient  schedule  for  the  appli¬ 
cation  is  identified  and  deployed.  We  developed  a  simple 
and  effective  coallocation  strategy  for  the  GTOMO  AppLeS 
which  targets  both  supercomputers  and  interactive  worksta¬ 
tions.  Our  experiments  show  that  the  GTOMO  AppLeS 
coallocation  strategy  improves  the  turnaround  time  of  the 
application  over  strategies  which  target  either  interactive 
workstations  alone  or  a  parallel  supercomputer  alone.  We 
believe  that  the  GTOMO  AppLeS  coallocation  strategy  will 
be  effective  for  other  work  queue  applications  as  well. 

The  next  section  provides  a  brief  description  of 
GTOMO.  Section  3  describes  our  coallocation  strategy  for 
scheduling  GTOMO  over  workstations  and  supercomput¬ 
ers.  Section  4  presents  the  results  of  comparing  our  strategy 
against  other  scheduling  alternatives.  Section  5  discusses 
related  work.  Section  6  concludes  the  paper  and  discusses 
future  work. 


lIn  practice,  queue  times  may  range  from  seconds  to  days. 


2.  GTOMO  Structure 

Tomography  allows  for  the  reconstruction  of  the  3-D 
structure  of  an  object  based  on  2-D  projections  through  it 
taken  at  different  angles.  Electron  microscopy  is  a  classical 
use  for  tomography.  Biological  specimens  on  the  cellular 
and  sub-cellular  level  are  viewed  with  an  electron  micro¬ 
scope  and  their  images  are  recorded  at  a  number  of  differ¬ 
ent  angles.  These  images  are  then  aligned  and  reconstructed 
into  3-D  volumes  using  analytic  and  iterative  tomographic 
techniques  [18]. 

Reconstructing  a  typically  sized  volume  using  a  simple 
algorithm  (filtered  back-projection)  currently  takes  several 
hours  on  a  workstation.  NCMIR  researchers  have  been  in¬ 
terested  in  increasing  the  computation  speed  of  the  recon¬ 
struction  for  two  reasons.  First,  they  want  to  make  use 
of  more  elaborate  tomographic  algorithms,  which  produce 
more  refined  3-D  volumes.  These  algorithms  are  more  com¬ 
putationally  intensive  than  the  algorithms  currently  used. 
Second,  NCMIR  is  interested  in  on-line  tomography  where 
the  volume  is  rendered  while  the  biologist  is  still  collecting 
data  on  the  microscope.  This  provides  immediate  feedback 
about  the  specimen  being  viewed  and  thus  may  prompt  the 
researcher  to  change  the  experiment  as  a  whole,  or  just  some 
parameters  of  it  (e.g.,  orientation  and/or  number  of  projec¬ 
tions).  For  this  to  be  useful,  a  rough  reconstruction  would 
have  to  finish  in  5  to  10  minutes.  No  single  processor  can 
achieve  this  presently,  which  led  the  NCMIR  researchers  to 
explore  parallelism. 

The  tomography  application  is  highly  amenable  to  par¬ 
allelism.  Because  specimens  are  only  rotated  about  a  single 
axis  as  images  are  acquired,  for  any  slice  orthogonal  to  the 
axis  of  rotation,  all  information  for  that  slice  falls  onto  a  sin¬ 
gle  line  on  each  of  the  projections  (see  Figure  1).  More  im¬ 
portantly,  any  such  slice  can  be  reconstructed  independently 
of  projection  information  for  the  rest  of  the  volume.  This 
makes  the  reconstruction  embarrassingly  parallel.  There¬ 
fore,  the  the  tomographic  reconstruction  can  naturally  be 
implemented  as  a  work  queue.  In  our  implementation,  we 
use  Globus  services  to  support  efficient  application  execu¬ 
tion  within  a  heterogeneous,  distributed  environment. 

The  structure  of  GTOMO  is  depicted  in  Figure  2.  There 
are  four  types  of  application  processes:  driver,  reader, 
writer ,  and  ptomo.  The  driver  controls  the  work  queue: 
it  assigns  one  work  unit  or  slice  to  a  free  ptomo  until  no 
more  slices  remain.  The  driver  is  invoked  by  the  user  and 
starts  up  the  other  processes.  The  reader  and  writer  are  I/O 
processes  and  hence  have  direct  access  to  the  user  file  sys¬ 
tem.  The  reader  reads  input  files  off  the  disk  and  sends  them 
to  the  ptomos  for  processing.  The  writer  receives  output 
files  from  ptomos  and  writes  them  to  disk.  Note  that  the 
reader  and  writer  enable  GTOMO  to  run  across  different 
file  system  domains.  The  ptomo  receives  input  files  from  a 
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Figure  1.  Projection  geometry  relating  to  a  single-axis  tilting  experiment  (from  [12]) 


reader,  does  all  the  computational  work,  and  sends  output 
to  a  writer.  In  this  study,  we  use  one  reader,  one  writer,  and 
any  number  of  ptomos.  Due  to  the  multi-threaded  nature  of 
Globus’  Nexus  communications  library,  one  reader  can  ser¬ 
vice  I/O  requests  for  many  ptomos  simultaneously,  and  the 
same  applies  to  the  writer. 

3.  Scheduling  GTOMO 

Generally  speaking,  the  set  of  potential  resources  avail¬ 
able  to  GTOMO  consists  of  workstations  w\ , and  su¬ 
percomputers  Si , ...,  s<y.  A  request  to  run  a  process  p  on 
workstation  w  causes  p  to  start  immediately,  but  p  time- 
shares  w  with  other  processes.  To  use  a  supercomputer 
s,  one  has  to  specify  how  many  processors  n  will  execute 
copies  of  p  and  for  how  long  t .  The  n  copies  of  p  do  not 
necessarily  start  immediately;  they  might  wait  in  the  queue 
for  an  indeterminate  amount  of  time  until  n  nodes  become 
available  for  t  seconds.  However,  supercomputer  processes 
run  over  dedicated  resources  once  they  are  acquired. 

Scheduling  a  GTOMO  job  consists  of  (i)  choosing  the 
requests  to  send  to  both  supercomputers  and  workstations, 
and  (ii)  assigning  work  for  the  ptomos.  For  (ii),  we  use  the 
work  queue  strategy  shown  in  Figure  2  that  assigns  work  on 
demand.  For  the  first,  we  have  to  determine  performance- 
efficient  values  of  n  and  t  for  each  available  supercomputer 
s.  Our  goal  is  to  select  n  and  t  in  a  way  that  minimizes 
GTOMO’s  turnaround  time.  Note  that  difficulty  in  pre¬ 
dicting  supercomputer  queue  wait  times  make  it  difficult  to 


find  an  optimal  n  and  t  [8,  20,  14].  We  avoid  the  queue 
time  prediction  problem  by  using  supercomputer  nodes  that 
are  immediately  available.  Therefore,  we  minimize  the 
turnaround  time  of  GTOMO  by  scheduling  its  execution  at 
once  on  workstations  and  any  immediately  available  super¬ 
computer  nodes. 

We  assume  that  the  supercomputer  scheduler  can  provide 
us  with  the  maximum  values  of  n  and  t  for  which  execution 
can  begin  immediately.  In  our  implementation,  this  infor¬ 
mation  is  supplied  by  the  showbf  command  provided  with 
the  Maui  Scheduler  [15],  a  scheduler  available  for  the  IBM 
SP2.  The  showbf  command  returns  a  set  of  backfill  win¬ 
dows ,  be.  Each  bi  =  (n,  t)  where  n  nodes  are  available 

for  immediate  execution  for  the  next  t  seconds. 

The  GTOMO  AppLeS  scheduler  uses  the  following  al¬ 
gorithm  to  schedule  the  ptomos: 

for  i  =  1  to  a: 
b  =  showbf  ( Si); 
for  j  =  1  to  6: 

start  bj  .n  ptomos  on  S{  for  time  bj .  t 
for  i  =  1  to  u): 
start  ptomo  on  Wi 

Therefore,  if  backfill  windows  are  available  on  any  of  the 
supercomputers,  the  job  will  be  coallocated  on  those  idle 
supercomputer  nodes  and  workstations.  If  no  backfill  win¬ 
dows  are  returned  by  any  of  the  supercomputers,  the  job 
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Figure  2.  Application  components  of  GTOMO.  Solid  lines  represent  transfer  of  input  and  output. 
Dotted  lines  denote  control  connections 


will  run  only  on  workstations.  The  reader  is  scheduled  on 
the  machine  where  the  input  data  is  located  and  the  writer 
is  scheduled  on  the  machine  where  the  output  data  will  be 
placed. 

Note  that  the  nodes  immediately  available  in  the  SP2 
may  not  be  available  for  the  full  duration  of  the  application. 
Therefore,  the  GTOMO  AppLeS  scheduler  has  to  cope  with 
ptomo  processes  that  detach  themselves  from  the  applica¬ 
tion  before  execution  has  completed.  We  have  added  a  fault 
recovery  mechanism  to  GTOMO,  which  enables  us  to  treat 
this  problem  as  a  ptomo  failure.  Whenever  a  ptomo  fails, 
the  slice  it  was  processing  is  returned  to  the  work  queue. 
We  can  use  such  a  simple  scheme  because  processing  a  slice 
has  no  side  effects.  The  advantage  of  reducing  this  problem 
to  fault  recovery  is,  of  course,  that  it  also  covers  real  faults. 

4.  Experimental  Results 

We  denote  the  GTOMO  AppLeS  scheduling  strategy  as 
SP2ImmedAVS  since  it  adaptively  combines  both  the  im¬ 
mediately  available  SP2  nodes  and  workstations.  In  order 
to  ascertain  how  this  strategy  performs,  we  compared  it 
against  other  possible  scheduling  strategies:  using  worksta¬ 
tions  only  (WS\  using  only  the  nodes  that  are  immediately 


available  in  the  SP2  (, SPUmmed ),  and  requesting  a  predeter¬ 
mined  number  of  nodes  in  the  SP2  and  probably  waiting  for 
them  in  the  queue  {SP2Queue).  WS  and  SP2Queue  respec¬ 
tively  are  the  standard  ways  to  use  a  cluster  of  workstations 
and  a  parallel  supercomputer. 

We  ran  experiments  on  a  cluster  of  7  workstations  avail¬ 
able  in  the  Parallel  Computation  Laboratory  (PCL)  at  U.C. 
San  Diego  and  on  the  San  Diego  Supercomputer  Center’s 
SP2,  one  of  the  supercomputers  available  to  NCMIR  scien¬ 
tists.  The  PCL  workstation  cluster  includes  one  200  MHz 
UltraSPARC  2,  a  1 10  MHz  Sparc  5,  a  85  MHz  Sparc  5,  and 
four  400  MHz  Pentium  IIs.  The  workstations  are  connected 
by  a  mixture  of  10  and  100  Mbit/s  ethernet  subnets.  The  SP 
has  1 28  thin  node  POWER2  processors  running  at  1 60  MHz 
where  processor  pairs  are  interconnected  by  a  1 10  MB/s  bi¬ 
directional  network  [21].  Other  users  were  present  on  all 
resources  during  the  experiments.  Our  dataset  consisted  of 
300  slices;  each  input  slice  was  238  KB  and  the  output  slice 
was  1.2  MB. 

We  note  that  it  is  problematic  to  design  experiments 
which  compare  multiple  scheduling  strategies  under  the 
same  load  and  queue  conditions  for  multi-user  production 
environments.  In  such  environments,  the  load  and  availabil¬ 
ity  of  resources  change  over  time,  so  reproducibility  of  the 
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same  ambient  load  conditions  is  generally  not  an  option.  In 
contrast,  it  is  possible  to  achieve  reproducibility  using  sim¬ 
ulation,  but  it  may  be  difficult  to  represent  dynamic  load 
variation  in  complex  heterogeneous  systems  authentically. 

For  our  experiments,  we  performed  sets  of  runs  of 
SP2Immed,  SP2Immed/WS,  WS,  and  SP2 Queue  back-to- 
back  2  hoping  that  experiments  within  the  same  set  would 
enjoy  roughly  similar  load  conditions.  Moreover,  we  moni¬ 
tored  the  number  of  free  nodes  in  the  SP2  and  used  this  in¬ 
formation  to  discard  execution  sets  in  which  the  nodes  avail¬ 
able  to  SP2Immed/WS  and  SP2Immed  differed  by  more 
than  two.  In  this  case,  we  considered  the  load  conditions 
for  strategies  in  the  same  set  to  be  different.  This  was  the 
case  in  37  (out  of  100)  experiment  sets.  We  therefore  ended 
up  with  63  valid  experiment  sets. 

There  are  two  other  details  to  note  in  the  design  of  our 
experiments.  First,  when  we  use  only  the  resources  immedi¬ 
ately  available  in  the  SP2,  it  might  be  that  there  are  no  nodes 
available  to  execute  the  application.  In  this  case,  we  did  not 
run  the  set  of  experiments  until  the  necessary  resources  be¬ 
came  available.  This  happened  9  times  out  of  63  attempts. 
Notice  that  by  excluding  this  retry  time,  we  present  opti¬ 
mistic  turnaround  times  for  the  SP2Immed  method.  A  user 
using  this  method  would  have  experienced  longer  delays. 

Second,  we  needed  to  decide  on  n  and  t  when  we  used 
the  SP2  in  the  traditional  way  (i.e.,  SP2Queue).  Note  that 
the  determination  of  the  best  n  requires  an  accurate  queue 
time  prediction.  Since  such  predictions  are  not  available, 
we  rotated  among  values  of  n  likely  to  be  used  by  GTOMO 
users:  8,  16,  and  32  nodes.  We  then  executed  benchmarks 
on  the  SP2  to  determine  the  average  processing  time  of 
one  slice,  This  enabled  us  to  conservatively  determine 
t  given  n  (a  conservative  estimate  is  needed  because  a  job 
is  killed  when  its  execution  exceeds  t)  using  the  following: 

tb  x  number  of  slices 

t  =  2  x  - - - 

n 

The  results  of  the  63  experimental  sets  are  parti¬ 
tioned  into  three  groups  using  the  number  of  nodes  re¬ 
quested  for  SP2Queue:  SP2Queue(8),  SP2Queue(16),  and 
SP2Queue(32).  Figure  3  shows  the  results  of  the  experi¬ 
ment  sets  in  which  SP2Queue  used  8  nodes,  Figure  4  shows 
the  results  for  16  nodes,  and  Figure  5  shows  the  results  for 
32  nodes.  Figures  3-5(a)  depict  the  turnaround  times  of 
the  different  strategies  (WS,  SP2Immed/WS,  SP2Immed, 
and  SP2Queue).  Each  set  of  bars  in  the  figure  depicts  a 
set  of  four  executions,  one  under  each  of  the  four  strate¬ 
gies.  Since  several  of  the  SP2Queue  turnaround  times  did 
not  fit  on  the  graphs,  Table  1  displays  the  turnaround  times 
for  the  SP2Queue  runs.  The  number  of  nodes  SP2Immed 


2We  used  a  5  minute  interval  between  experiments  to  ensure  that  the 
Maui  scheduler  had  time  to  update  its  availability  information. 


and  SP2Immed/WS  acquired  in  each  set  of  experiments  is 
shown  in  Figures  3-5(b). 

We  see  that  the  SP2Immed/WS  strategy  yielded  the  best 
performance  in  all  cases  except  one  (Figure  4,  run  8).  Fur¬ 
ther  study  indicated  contention  on  the  reader  and  writer  due 
to  the  selection  of  too  many  ptomos  (in  this  experiment  set, 
we  received  the  highest  number  of  immediately  available 
SP2  nodes).  A  future  scheduler  improvement  would  be  to 
model  the  contention  and  incorporate  it  into  the  GTOMO 
AppLeS. 

We  also  assess  the  variability  of  each  strategy  using  the 
coefficient  of  variance,  cVi  which  measures  the  amount  of 
variance  relative  to  the  mean  [7],  It  is  defined  as  follows: 

standard  deviation 

cv  — 

mean 

The  SP2Immed/WS  strategy  exhibited  the  lowest  cv  in  all 
groups  of  experiments.  Table  2  shows  the  mean,  coeffi¬ 
cient  of  variance,  minimum,  and  maximum  values  for  each 
strategy  in  each  group  of  experiments.  Table  2(a)  shows 
the  results  of  the  experiment  sets  where  SP2Queue  used 
8  nodes,  Table  2(b)  shows  the  results  for  16  nodes,  and 
Table  2(c)  shows  the  results  for  32  nodes.  As  expected, 
SP2Queue’s  cv  is  quite  large  due  to  the  unpredictable  wait 
times  in  the  queue.  While  its  turnaround  time  was  some¬ 
times  close  to  SP2Immed/WS  (544s  for  SP2Queue  vs.  601s 
for  SP2Immed/WS  for  the  one  time  it  beat  SP2Immed/WS), 
its  worst  time  was  more  than  two  orders  of  magnitude 
greater  than  SP2Immed/WS  (88,323s  for  SP2Queue  vs. 
601s  for  SP2Immed/WS).  Also,  we  note  that  the  SP2Immed 
strategy  had  a  high  cv  in  the  SP2Queue(8)  results  due  to 
the  variability  of  number  of  nodes  acquired.  This  variabil¬ 
ity  was  amortized  in  the  SP2Immed/WS  strategy  because  of 
the  relatively  low  cv  of  WS. 

5.  Related  Work 

The  GTOMO  code  is  also  used  in  the  Computed  Micro¬ 
tomography  (CMT)  Project  at  Argonne  National  Labora¬ 
tory  (ANL)  [26,  27].  In  contrast  to  NCMIR,  projections 
are  collected  from  a  x-ray  source  at  the  Advanced  Photon 
Source  (APS)  located  at  ANL.  Their  work  has  focused  on 
on-line  tomography  where  data  is  collected  at  APS,  trans¬ 
ferred  to  a  128  node  SGI  Origin  2000  for  processing,  and 
then  transferred  back  to  the  user  for  visualization.  Cur¬ 
rently,  they  are  able  to  deliver  a  reconstructed  image  to  the 
user  within  minutes  after  data  acquisition  has  completed. 
The  CMT  and  NCMIR  versions  of  GTOMO  are  currently 
being  integrated  as  part  of  the  NPACI  Telescience  Alpha 
Project  [25]. 

Application  scheduling  for  Grids  is  a  recent  and  very  ac¬ 
tive  area.  Existing  work  has  focused  primarily  on  resource 
discovery  and  scheduling  [4,  17,  10,  28]  and  coallocation 
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run 

8  nodes 

16  nodes 

32  nodes 

1 

685.0235 

591.2153 

1293.9811 

2 

759.2754 

581.2684 

532.6011 

3 

698.2693 

20483.5053 

536.9106 

4 

3077.8569 

723.0844 

541.6972 

5 

701.3900 

17268.0273 

658.2000 

6 

708.8977 

2097.6415 

565.0041 

7 

691.5886 

579.8207 

27480.0918 

8 

1805.0212 

543.7824 

9868.9585 

9 

6163.5884 

581.0620 

2267.3595 

10 

687.5382 

615.1581 

614.1135 

11 

689.0494 

793.9383 

17735.0314 

12 

682.3540 

9193.4053 

39286.8783 

13 

700.8448 

742.4905 

34642.5641 

14 

4625.1468 

2164.4710 

1120.0531 

15 

1260.1495 

621.3329 

88322.6571 

17 

3249.5812 

574.9321 

1809.4320 

18 

710.8228 

593.0303 

664.4993 

19 

718.0481 

1203.5411 

6815.6301 

20 

721.3216 

575.2712 

607.1331 

719.9383 

33715.8070 

29675.0315 

580.8794 

Table  1.  Turnaround  times  for  SP2Queue 


strategy 

mean 

Cy 

min 

max 

SP2Immed 

1946.68 

2.10 

19694.46 

SP2Immed/WS 

msm 

0.17 

554.50 

WS 

0.19 

1133.44 

SP2Queue 

UMSI 

6163.59 

(a)  SP2Queue(8)  results 


strategy 

mean 

cv 

min 

max 

SP2Immed 

wm 

1105.61 

SP2Immed/WS 

mu 

600.98 

WS 

777.53 

0.19 

588.76 

1127.03 

SP2Queue 

4515.41 

1.94 

543.78 

33715.81 

(b)  SP2Queue(16)  results 


strategy 

cv 

max 

SP2Immed 

liia 

EB 

1262.72 

SP2Immed/WS 

«ia 

giUSB 

EH 

519.13 

WS 

789.69 

0.20 

587.68 

1128.41 

SP2Queue 

13251.89 

1.66 

532.60 

88322.66 

(c)  SP2Queue(32)  results 


Table  2.  Summary  results  of  experiments 
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among  workstations  [3,  1,  19,  22, 23, 9].  The  work  reported 
herein  extends  the  target  domain  for  GTOMO  by  targeting 
both  parallel  supercomputers  and  interactive  resources  si¬ 
multaneously. 

6.  Discussion  and  Conclusions 

In  this  work,  we  show  how  to  combine  workstations  and 
supercomputers  to  run  GTOMO,  a  work  queue  application 
used  in  production  at  NCMIR.  Our  solution  automatically 
selects  all  resources  immediately  available  across  the  sys¬ 
tem.  We  leverage  the  Maui  Scheduler  to  obtain  informa¬ 
tion  on  immediately  available  SP2  nodes.  This  strategy  his 
the  advantage  of  not  requiring  predictions  of  how  long  re¬ 
quests  wait  in  the  supercomputer  queue.  Our  experimental 
results  show  that  the  GTOMO  AppLeS  scheduling  strategy 
consistently  outperforms  three  other  strategies  that  can  be 
used  for  scheduling  in  an  typical  laboratory  setting  where 
researchers  have  access  to  a  local  cluster  of  workstations 
and  supercomputer  time. 

We  have  learned  three  interesting  lessons  about  Compu¬ 
tational  Grids  in  general  as  a  result  of  this  effort.  First,  the 
interface  exported  by  the  resource  scheduler  has  great  im¬ 
pact  on  application  schedulers.  In  fact,  we  can  implement 
our  strategy  in  a  very  straightforward  manner  thanks  to  the 
Maui  Scheduler’s  showbf  command.  On  the  other  hand, 
the  Maui  Scheduler  (as  with  other  supercomputer  sched¬ 
ulers,  for  that  matter)  precluded  us  from  trying  something 
more  sophisticated  due  to  the  difficulty  in  predicting  queue 
times  for  supercomputer  requests.  Emerging  efforts  such 
as  S3  [6],  GARA  [11],  and  more  generally,  the  Grid  Forum 
Scheduling  Working  Group  [  13]  are  working  to  change  this. 

Second,  evaluating  solutions  for  real  applications  run¬ 
ning  over  production  environments  has  proven  to  be  diffi¬ 
cult  due  to  the  impossibility  of  reproducing  the  system  load 
and  queue  conditions  for  comparison  runs.  Others  have  en¬ 
countered  the  same  problem.  Indeed,  a  simulation  environ¬ 
ment  specifically  targeted  toward  Grids  such  as  the  Bricks 
project  [24],  the  MicroGrid  [16],  or  the  work  described 
in  [5]  would  be  very  useful. 

Third,  fault  tolerance  is  likely  to  be  even  more  important 
in  Grid  computing  than  it  is  in  parallel  computing.  For  our 
solution  in  particular,  fault  recovery  was  a  natural  way  to 
deal  with  the  time  expiration  of  SP2  requests.  In  general, 
using  autonomous  and  distributed  resources  increases  the 
chance  that  some  component  of  the  application  will  fail. 

The  GTOMO  AppLeS  scheduler  has  been  incorporated 
with  the  production  version  of  GTOMO  at  NCMIR  and  is 
used  daily  be  researchers.  Current  work  involves  extending 
the  applicability  of  the  scheduler  to  additional  resources  and 
different  scenarios  of  the  application. 
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Abstract 

This  paper  examines  the  issues  surrounding  efficient  ex¬ 
ecution  in  heterogeneous  grid  environments.  The  perfor¬ 
mance  of  a  Linux  cluster  and  a  parallel  supercomputer  is 
initially  compared  using  both  benchmarks  and  an  applica¬ 
tion.  With  an  understanding  of  how  benchmark  and  appli¬ 
cation  performance  is  affected  by  processor  and  intercon¬ 
nect  speed ,  a  comparison  is  made  with  the  bandwidth  and 
latencies  available  in  a  grid  testbed.  Of  significant  concern 
is  the  fact  that  the  available  communication  bandwidth  and 
latencies  have  a  dynamic  range  of  3  to  4  orders  of  mag¬ 
nitude  while  processor  speeds  have  a  range  of  about  one 
half  order  of  magnitude.  Also ,  while  both  processor  speed 
and  network  bandwidth  are  increasing  veiy  rapidly ,  simple 
propagation  delay  will  become  more  significant  in  the  net¬ 
work  latencies  seen  by  many  grid  applications.  That  is  to 
say,  the  pipes  in  a  grid  will  be  getting  fatter  but  not  commen¬ 
surate  ly  shorter.  How  are  we  to  effectively  utilize  such  an 
infrastructure?  Clearly  an  attractive  approach  is  to  require 
sufficient  concurrency  in  the  application  such  that  a  coarse- 
grain,  data-driven  model  of  execution  can  be  used  to  hide 
latencies  while  hopefully  keeping  context  switching  over¬ 
heads  low.  If  the  “ spatial  component  ”  of  an  application 
is  understood,  then  runtime  systems  could  also  apply  estab¬ 
lished  techniques  like  caching,  compression,  estimation  and 
speculative  pre-fetching.  Ideally  this  low-level  performance 
management  should  be  encapsulated  in  an  easy-to-use  ab¬ 
straction. 


1  Introduction 

Cluster  computing  has  been  gaining  wide  acceptance 
over  single-machine,  massively  parallel  computing  due  to 


its  undeniable  cost-effectiveness  for  suitable  applications 
[4].  Since  clusters  are  built  from  commodity  hardware, 
however,  they  typically  have  slightly  slower  processors 
and  lower  communication  bandwidths  than  “big  iron”  ma¬ 
chines.  Hence,  suitability  in  this  context  means  simply  that 
either  (1)  an  application  must  be  more  tolerant  of  higher 
communication  costs,  or  (2)  the  user’s  “mission  require¬ 
ments”  are  lenient  enough  to  accept  the  lower  performance 
at  a  much  lower  dollar  cost. 

The  increasing  potential  of  grid  computing ,  however, 
means  that  users  and  applications  will  be  faced  with  envi¬ 
ronments  that  have  an  even  greater  heterogeneity  of  com¬ 
munication  abilities.  [8].  While  this  potential  includes  the 
flexible  harnessing  of  resources  on  a  scale  not  previously 
considered  for  individual  applications,  it  also  means  that 
achieving  efficient  use  of  those  resources  will  be  harder  than 
ever.  This  paper  endeavors  not  to  present  any  solutions  to 
this  problem  but  to  quantitatively  demonstrate  the  bounds  of 
the  problem  as  motivation  for  exploring  candidate  program¬ 
ming  and  execution  models  that  can  effectively  operate  in  a 
grid  environment. 

We  will  do  this  by  comparing  the  performance  of  a  paral¬ 
lel  machine,  a  cluster,  and  a  grid  testbed  by  several  means. 
These  are  specifically  the  Cray  T3E  [13],  a  Pentium  clus¬ 
ter  with  fast  ethernet,  and  the  Globus  GUSTO  testbed  [7], 
which  are  representative  of  their  respective  classes.  (Dif¬ 
ferent  examples  of  each  class  could  be  used  but  the  funda¬ 
mental  relationships  between  them  would  not  be  altered.) 
The  Cray  T3E  used  here  is  the  T3E-1200  at  the  CEWES 
Major  Shared  Resource  Center  in  Vicksburg,  Mississippi. 
It  has  512  DEC  Alpha  21164  processors  clocked  at  600 
MHz.  with  128  MB  of  memory  per  processor.  Its  3D  torus 
dedicated  interconnect  is  capable  of  650  MB/sec.  (theoret¬ 
ical  peak)  in  both  directions.  The  cluster  used  here  is  at 
the  Aerospace  Corporation.  It  has  fourteen  Intel  Pentium 
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II  processors  clocked  at  400  MHz.  running  Red  Hat  Linux 
with  192  MB  of  memory  per  processor.  They  are  connected 
by  100  Mbit/sec.  fast  ethernet  through  a  Baystack  450-24T 
switched  ethernet  hub.  The  Globus  GUSTO  testbed  is  dis¬ 
tributed  across  many  adminstative  sites  and  includes  a  large 
variety  of  machines.  The  current  configuration  of  GUSTO 
can  always  be  examined  by  using  the  Metacomputing  Di¬ 
rectory  Service  (MDS)  Browser  on  the  Globus  web  site 
(www.globus.org). 

These  “machines”  (including  the  grid)  will  be  compared 
using  parallel  benchmarks,  a  parallel  application,  and  a  dis¬ 
tributed  performance  monitoring  tool.  We  will  look  at  the 
relative  processing  speeds  and  communication  speeds.  We 
then  discuss  the  implications  of  achieving  efficiency  in  an 
increasingly  heterogeneous  computing  infrastructure. 

2  A  Benchmark  Comparison 

To  compare  the  performance  of  a  Linux  cluster  and  a  par¬ 
allel  supercomputer,  we  use  the  NAS  Parallel  Benchmarks 
[12].  These  benchmarks  were  developed  by  the  Numeri¬ 
cal  Aerodynamics  Simulation  (NAS)  group  at  NASA  Ames 
with  the  goal  of  being  able  to  make  more  reliable  quanti¬ 
tative  performance  comparisons  among  parallel  machines. 
These  benchmarks  consist  of  numerical  kernels  for  a  wide 
range  of  computing  problems.  Rather  than  a  single  “micro¬ 
benchmark”  that  may  exercise  only  one  aspect  of  a  machine, 
these  benchmarks  were  chosen  to  exercise  all  aspects  of 
a  machine,  individually  and  in  combination.  Specifically 
these  benchmarks  exercise  communication,  integer  compu¬ 
tation  and  floating-point  computation. 

For  brevity  and  conciseness,  we  only  need  to  present  the 
results  of  two  benchmarks  that  illustrate  the  major  differ¬ 
ence  between  these  two  platforms.  For  each  benchmark, 
the  per  processor  performance  is  plotted  as  a  function  of 
the  number  of  nodes.  These  two  benchmarks  involve  inte¬ 
ger  computation,  so  the  measurement  metric  is  millions  of 
operations  per  second  per  node:  Mop/sec/node.  This  allows 
the  relative  performance  and  scalability  on  each  platform  to 
be  shown  in  one  graph.  For  each  benchmark,  there  are  also 
three  classes ,  A,  B,  and  C,  that  correlate  to  three  different 
problem  sizes,  with  A  being  the  smallest  and  C  being  the 
largest.  Hence,  for  each  benchmark  graph,  there  are  three 
curves  (one  for  each  class)  for  both  platforms.  For  con¬ 
sistency  and  ease  of  comparison,  the  same  point  symbol  is 
used  for  each  platform.  The  same  line  style  is  used  for  each 
benchmark  class. 

Figure  1  shows  the  Random  Number  Generation  bench¬ 
mark.  This  is  an  “embarrassingly  parallel”  benchmark  since 
the  parallel  tasks  (generating  random  numbers)  are  com¬ 
pletely  independent,  i.e.,  after  the  tasks  are  started,  there  is 
absolutely  no  communication  or  synchronization  between 
nodes.  As  expected,  both  platforms  show  good  scaling  (flat 


curves).  The  Alpha  processors,  however,  are  approximately 
4x  faster  than  the  Pentium  IIs. 

Figure  2  shows  the  Integer  Sorting  benchmark.  Integer 
sorting  is  not  a  computationally  complex  task  since  it  pri¬ 
marily  requires  the  comparison  of  integers.  It  can,  however, 
require  massive  amounts  of  communication  as  data  values 
are  relocated  to  their  sorted  positions.  Here  we  see  that  the 
T3E  exhibits  not  only  faster  processing  but  also  much  bet¬ 
ter  communication  scaling.  For  the  cluster,  the  per-node 
performance  falls  off  dramatically  as  the  number  of  nodes 
increases. 

These  two  benchmarks  dramatically  illustrate  the  per¬ 
formance  differences  between  parallel  and  distributed  com¬ 
putations  that  are  compute-bound  versus  communication- 
bound.  In  the  sorting  benchmark,  the  communication  band¬ 
width  is  clearly  dominating  the  overall  performance.  In 
terms  of  relative  performance  and  scalability,  the  other  NAS 
Parallel  Benchmarks  fall  inbetween  these  two  extremes. 

3  An  Application  Comparison 

In  this  section,  we  use  an  application  to  compare  the 
performance  of  these  two  platforms.  That  application 
is  ALSINS  (Aerospace  Launch  Systems  Implicit  Navier- 
Stokes),  a  computational  fluid  dynamics  (CFD)  code  devel¬ 
oped  by  the  Fluid  Mechanics  Department  at  Aerospace  and 
used  to  investigate  flow  fields  of  the  Delta-II  and  Titan-IV 
launch  vehicles  [16,  15]. 

CFD  works  by  discretizing  the  space  around  a  physical 
object  into  “cells”  and  computing  the  flux  of  material  be¬ 
tween  cells  by  solving  the  Navier-Stokes  equations  for  a 
sequence  of  time  steps  until  the  solution  has  converged  to 
a  final  state.  CFD  is  typically  parallelized  by  decompos¬ 
ing  the  discretized  spatial  domain  and  assigning  different 
blocks  to  different  processors.  The  algorithm  has  an  iter¬ 
ative  structure  consisting  of  (1)  exchanging  neighbor  data, 
(2)  computing  the  minimum  time  step  among  all  blocks, 
and  (3)  computing  the  flux  for  the  current  time  step.  For 
ALSINS,  this  is  implemented  using  MPI. 

With  this  basic  structure,  there  are  two  hard  synchroniza¬ 
tions  per  iteration:  exchanging  neighbor  data  and  the  mini¬ 
mum  time  reduction.  Aside  from  potential  synchronization 
delays,  the  minimum  time  reduction  is  a  very  quick  opera¬ 
tion  since  it  only  involves  finding  the  minimum  of  a  single 
floating-point  time  step  value  across  all  nodes.  The  time  re¬ 
quired  for  communication  and  the  local  flux  computation, 
however,  depends  on  the  data  block  size  allocated  to  each 
node.  Note  that  it  is  possible  to  improve  efficiency  by  over¬ 
lapping  communication  and  computation  for  a  given  itera¬ 
tion.  The  rate  of  convergence  for  the  solution  depends  on 
the  geometry  of  the  test  case  and  can  be  on  the  order  of 
105  iterations.  The  speed  at  which  iterations  can  be  com¬ 
puted  depends  on  the  total  size  of  the  discretized  space  and 


254 


Random  Number  Generation 


"t3e.ep.A.mops"  — i — 
“tSe.ep.B.mops"  — i — 

"t3e.ep.C. mops’*  + — 

"beo.ep.A.mops”  — x — 
Hbeo.ep.B.mopsu  — x — 
"beo.ep.C.mops"  . 


A - - - 1 - - - L_ 

5  10  15 

number  of  nodes 


Figure  1.  The  Random  Number  Generation  Benchmark. 
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Figure  2.  The  Integer  Sorting  Benchmark. 
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the  number  and  speed  of  the  processing  nodes  used  on  the 
problem. 

The  test  case  computed  using  ALSINS  is  the  flow  field 
around  the  base  of  a  Centaur  launch  vehicle  with  both  en¬ 
gines  running  with  exhaust  plumes.  Figure  3  shows  a  flow 
field  computation  done  on  the  Pentium  cluster. 

ALSINS  performance  was  measured  and  analyzed  us¬ 
ing  NetLogger  [14],  a  tool  developed  at  Lawrence  Berke¬ 
ley  Lab  for  analyzing  distributed  systems.  NetLogger  logs 
timestamped,  application-defined  events  either  locally  or  to 
a  remote  logging  daemon.  The  NetLogger  visualization 
tool,  nlv,  can  subsequently  display  these  events  as  grouped, 
color-coded  sets  of  events,  called  lifelines,  over  time.  Other 
events  or  statistics  associated  with  a  scalar  values,  such  as 
cpu  load,  can  be  also  be  displayed  as  loadlines. 

ALSINS  with  the  Centaur  Double  Nozzle  test  case  was 
run  on  both  platforms  in  two  versions  using  overlapped  and 
non-overlapped  communication.  The  NetLogger  visualiza¬ 
tion  display  for  two  representative  iterations  of  the  over¬ 
lapped  code  version  on  the  Pentium  cluster  is  shown  in 
Figure  4.  This  shows  the  color-coded,  per-iteration  life¬ 
lines  for  each  node.  Each  lifeline  consists  of  six  events 
tags:  REDUCE.TAU,  START.COMM,  START-SOLVER. 
SOLVER-DONE,  COMM.DONE,  and  COPIES.DONE. 
Loadlines  for  the  utilization  (number  of  processors  actively 
engaged  in  communication  or  computation)  and  the  effi¬ 
ciency  (utilization  over  the  duration  of  the  computation)  are 
also  shown. 

These  results  show  us  that  per  iteration,  ALSINS  is 
«2.9x  faster  on  the  T3E  than  on  the  cluster  (4.75  seconds 
vs.  1.63  seconds).  Part  of  this  difference  is  due  to  the  faster 
processors  on  the  T3E  and  also  a  memory  subsystem  aug¬ 
mented  with  stream  buffers.  Of  this  iteration  time,  however, 
there  is  «19%  idle  time  due  to  the  load  imbalance.  Hence, 
there  is  an  efficiency  of  «8 1  %  where  processors  are  busy 
doing  communication  or  computation.  On  the  cluster,  com¬ 
munication  takes  w  1 6%  of  the  iteration  time.  On  the  T3E, 
communication  takes  ~2%. 

4  A  Bandwidth  ^benchmark  Comparison 

It  is  certainly  not  news  that  communication  bandwidth 
plays  a  direct  role  in  determining  parallel  application  per¬ 
formance.  But  in  the  scope  of  emerging  computational  in¬ 
frastructures,  however,  what  is  the  depth  of  the  communica¬ 
tion  hierarchy?  What  is  the  range  of  impact  that  communi¬ 
cation  infrastructures  can  have,  will  have,  on  distributed, 
parallel  applications?  We  examine  this  question  in  two 
parts.  First,  we  do  a  simple  MPI  bandwidth  test  between  the 
Pentium  cluster  and  the  T3E.  Second,  we  compare  these  re¬ 
sults  with  a  histogram  of  host  pair  bandwidths  on  the  Globus 
GUSTO  testbed  [7]. 

The  MPI  bandwidth  test  program  we  used  tests  a  variety 


of  communication  patterns  with  differing  number  of  nodes 
and  different  data  volumes  (message  sizes).  We  ran  this  pro¬ 
gram  on  both  the  Pentium  cluster  and  the  T3E.  For  brevity 
and  conciseness,  we  present  only  the  most  relevant  data  in 
Table  1 .  In  this  particular  test,  bidirectional  communica¬ 
tion  occurs  among  all  nodes  simultaneously  for  two  to  eight 
nodes.  This  means  that  every  node  is  sending  and  receiving 
a  1  MB  message  from  all  other  nodes  at  the  same  time  to 
stress  the  limits  of  performance. 

The  cluster  is  theoretically  capable  of  100  Mbit/sec.  or 
12.5  MB/sec.  For  two  nodes,  a  bidirectional  bandwidth  of 
over  10  MB/sec.,  or  80  Mbit/sec.,  is  achieved.  This  is  the 
expected  end-to-end  result  since  overhead  in  the  message¬ 
passing  process,  e.g.,  buffer  copying  and  device  driver 
scheduling,  etc.,  means  that  an  application  will  always  see 
less  bandwidth  than  the  physical  medium  is  “clocked”  at;  in 
this  case,  fast  ethernet.  Note,  however,  that  as  more  nodes 
are  added  to  the  test,  the  realized  bandwidth  sinks  to  about 
4  MB/sec.  This  indicates  that  contention  for  resources  is 
occurring  somewhere.  (While  the  hub  is  technically  non- 
blocking,  it  may  still  have  a  backplane  that  is  becoming  sat¬ 
urated  as  the  aggregate  bandwidth  demand  increases.)  For 
the  T3E,  we  see  that  two  nodes  arc  capable  of  over  300 
MB/sec.  For  eight  nodes,  the  average  bidirectional  band¬ 
width  is  still  over  200  MB/sec.  This  means  that  the  ded¬ 
icated  communication  hardware  on  the  T3E  is  30x  to  50x 
faster  than  fast  ethernet  in  a  cluster.  (This  might  lead  one  to 
conclude  that  much  of  a  large  machine’s  cost  is  in  its  dedi¬ 
cated  interconnect.) 

How  do  these  bandwidths  compare  with  that  typically 
available  in  a  grid  environment?  To  answer  this  question, 
we  made  use  of  the  Gloperf  network  performance  data  that 
is  periodically  uploaded  into  the  Globus  Metacomputing 
Directory  Service  (MDS)  [6].  The  MDS  is  based  on  the 
Lightweight  Directory  Access  Protocol  (LDAP)  and  pro¬ 
vides  an  information  naming  scheme  and  repository  for  all 
manner  of  grid  computing  information,  e.g.,  available  hosts, 
number  of  nodes,  current  load,  network  interfaces,  gate¬ 
keeper  contact  information,  etc.  It  is  also  used  to  record 
bandwidth  and  latency  data  periodically  measured  between 
host  pairs  by  Gloperf. 

Gloperf  [10]  is  a  simple  tool  that  is  automatically  de¬ 
ployed  on  each  Globus  host.  At  Globus  boot-time,  the 
Gloperf  daemon  will  register  itself  in  the  MDS  and  then 
query  the  MDS  for  all  other  Gloperf  daemons.  The  daemon 
will  then  make  periodic  bandwidth  and  latency  tests  with  all 
other  daemons  and  store  the  results  in  the  MDS.  (The  initial 
implementation  of  Gloperf  did  measurements  between  all 
pairs  which  does,  of  course,  result  in  non-scalable  behavior. 
The  latest  implementation  uses  a  simple  group  scheme  to 
produce  hierarchies  of  measurements.) 

The  actual  Gloperf  measurement  mechanism  is  bor¬ 
rowed  from  netperf.  Gloperf  is  configured  to  perform  a 


256 


Figure  3.  The  Centaur  Double  Nozzle  Test  Case  Computed  on  a  Pentium  Cluster. 


Figure  4.  ALSINS  with  overlapped  communication/computation  on  the  Pentium  Cluster. 
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Cluster  Bandwidth,  MB/sec. 


Avg. 

Node  0 

Node  1 

Node  2 

Node  3 

Node  4 

Node  5 

Node  6 

Node  7 

10.348 

8.076 

8.730 

7.869 

6.472 

6.778 

4.489 

10.379 

10.225 

10.028 

9.979 

6.689 

8.355 

6.072 

10.317 

6.957 

7.561 

7.941 

6.216 

7.276 

4.907 

7.047 

9.910 

6.675 

7.035 

6.483 

4.948 

7.421 

8.135 

6.207 

5.868 

4.426 

6.616 

6.354 

7.194 

4.167 

6.333 

6.445 

3.942 

5.822 

3.727 

3.726 

T3E  Bandwidth,  MB/sec. 


Avg. 

Node  0 

Node  1 

Node  2 

Node  3 

Node  4 

Node  5 

Node  6 

Node  7 

319.252 

253.973 

224.496 

194.584 

203.172 

222.200 

211.705 

319.101 

253.161 

218.161 
194.335 
215.007 
231.971 
212.216 

319.402 

253.259 

211.452 

190.279 

181.744 

193.002 

191.008 

255.498 

241.411 

219.954 

242.082 

282.610 

236.182 

226.958 

183.446 

184.683 

198.681 

200.079 

184.905 

213.967 

238.679 

235.900 

181.548 

217.750 

222.200 

192.702 

205.023 

191.024 

Table  1.  Bandwidth  tests  for  T3E  and  Cluster. 


10-second,  TCP  “packet-blasting”  test  and  measure  the  data 
volume  sent.  Gloperf  also  measures  the  number  round-trips 
that  can  be  made  in  a  10-second  period.  The  sequence  of 
hosts  and  test  types  (bandwidth  and  latency)  are  random¬ 
ized  in  order.  Since  Gloperf  does  untuned  TCP  testing  from 
the  user-level,  it  essentially  observes  the  same  end-to-end 
performance  that  an  application  would  see. 

To  extract  the  Gloperf  data  from  the  MDS,  a  simple  pro¬ 
gram  would  periodically  snapshot  the  MDS  Gloperf  data 
into  log  files.  Scripts  were  then  used  to  extract  just  the  band¬ 
width  and  latency  data  and  eliminate  duplicates.  Figure  5 
shows  histograms  of  Gloperf  bandwidth  and  latency  mea¬ 
surements  on  GUSTO  beginning  in  August  and  continuing 
through  October,  1999.  This  represents  18615  unique  mea¬ 
surements  between  3405  unique  host  pairs  over  138  unique 
hosts;  mostly  in  North  America  but  including  a  few  in  Eu¬ 
rope,  Asia,  and  Australia.  Note  that  these  histograms  em¬ 
ploy  log-sized  bins  that  make  the  mode  of  the  distributions 
much  more  evident.  While  it  was  not  uncommon  to  ob¬ 
serve  bandwidths  as  high  as  96  Mbits/sec.,  the  median  band¬ 
width  for  this  distribution  is  2.2  Mbits/sec.  and  the  90th  per¬ 
centile  is  15.1  Mbits/sec.  While  the  latency  distribution  has 
a  much  narrower  (rhinokurtotic)  mode,  common  latencies 
span  three  orders  of  magnitude.  Here,  the  median  is  57.35 
msec,  and  90%  of  the  latencies  are  above  5.5  msec. 

It  is  clear  that  in  a  grid  environment  that  can  include  clus¬ 
ters  and  “big  iron”  machines,  there  can  be  a  3  to  4  order  of 
magnitude  dynamic  range  in  the  bandwidth  and  latencies 
available  to  an  application. 


5  Discussion  and  Implications 

What  are  the  implications  of  these  observations  for  het¬ 
erogeneous  cluster  performance  and  grid  performance?  The 
argument  can  be  made  that  there  is  a  much  greater  dynamic 
range  in  the  the  available  communication  bandwidths  than 
there  is  in  processor  speeds;  3  to  4  orders  of  magnitude  ver¬ 
sus  one  half  order  of  magnitude.  We  note  that  since  the 
graphs  in  Figure  5  represent  capacity  that  is  shared  among 
other  non-Globus  traffic,  one  could  argue  that  in  terms  of 
a  shared  resource,  processors  could  exhibit  the  same  range 
of  available  cycles.  A  counter-argument,  however,  is  that 
one  typically  has  greater  control  over  the  compute  resources 
rather  than  the  network;  even  if  one  does  not  have  com¬ 
plete  control  of  the  processors,  the  processors  may  be  batch 
scheduled.  Regardless  of  such  arguments,  in  many  cases 
there  will  be  a  significant  differential  between  the  available 
processor  speeds  and  network  speeds. 

What  is  the  implication  of  this  processor-network  dif¬ 
ferential?  We  note  that  processor  speeds  have  been  in¬ 
creasing  according  to  Moore’s  Law  (doubling  every  18 
months).  Memory  bandwidth,  however,  has  been  increas¬ 
ing  much  more  slowly,  by  some  estimates  as  little  as  7% 
per  year.  To  cope  with  this  processor-memory  differential, 
hardware  designers  have  had  use  increasingly  larger  caches 
and  to  employ  numerous  techniques  to  overlap  operations 
to  hide  latency,  such  as  speculative  execution,  prefetching, 
and  hardware  multithreading.  This  has  also  motivated  the 
research  in  Processing-In-Memory  (PIM)  architectures  [9] 
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latency  (ms),  100  log-sized  bins 

Figure  5.  Globus  testbed  bandwidth  and  latency  distributions. 
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where  much  higher  bandwidths  between  memory  and  the 
processing  units  can  be  realized. 

Fortunately  network  bandwidths  seem  to  be  increasing 
at  least  as  fast  as  Moore’s  Law,  if  not  faster,  since  around 
1994  (A.W.  -  After  Web).  Unfortunately  this  improvement 
in  bandwidth  will  not  affect  the  speed  of  light.  A  signifi¬ 
cant  part  of  latencies  present  in  a  grid  is  simply  propaga¬ 
tion  delay.  As  an  example,  the  end-to-end,  application-level 
message-passing  latency  between  Los  Angeles  and  Chicago 
can  already  be  as  much  as  33%  propagation  delay.  Clearly 
this  limits  the  reduction  in  latencies  that  are  physically  pos¬ 
sible  in  a  grid  computing  environment.  Hence,  in  ten  years 
time,  we  might  expect  the  bandwidth  distribution  in  Fig¬ 
ure  5  to  move  to  the  right  by  an  order  of  magnitude.  While 
the  possible  relative  reduction  in  latency  will  depend  on  the 
geographic  separation  among  the  compute  resources,  it  is 
safe  to  say  that  many  latencies  in  the  latency  distribution 
will  not  decrease  as  much.  The  bottom-line  is  that  pipes 
will  get  fatter  but  not  commensurately  shorter. 

How  exactly  will  this  relative  change  in  bandwidths, 
latencies  and  processing  speeds  affect  application  perfor¬ 
mance?  Work  done  by  Martin,  et  al.,  is  relevant  to  this  ques¬ 
tion  [11].  They  examined  the  effect  of  latency,  overhead  and 
bandwidth  on  cluster  performance.  For  this  work,  latency  is 
defined  as  the  end-to-end  delay  in  sending  a  message  from 
its  source  to  its  destination.  Overhead  is  defined  as  the  time 
that  a  processor  is  engaged  in  message  transmission  or  re¬ 
ceipt  during  which  it  cannot  do  anything  else.  Bandwidth  is 
inversely  defined  in  terms  of  the  “gap”  between  consecutive 
message  sends,  i.e.,  messages  per  unit  time. 

For  a  set  of  applications  on  an  UltraSPARC  cluster  us¬ 
ing  Myrinet  A  set  of  applications  were  run  on  an  Ultra¬ 
SPARC  cluster  using  Myrinet  where  the  LANai  processor 
on  the  Myricom  network  interface  card  was  used  to  emulate 
a  range  of  latencies,  bandwidths  and  overhead.  The  appli¬ 
cations  in  this  experimental  context  were  much  more  sen¬ 
sitive  to  overhead  than  to  bandwidth  or  latency.  For  these 
applications,  one  is  forced  to  conclude  that  since  the  ad¬ 
ditional  per  message  overhead  was  unavoidable,  it  directly 
affected  the  application’s  running  time,  while  at  least  part  of 
the  additional  latency  was  naturally  overlapped  or  hidden  by 
the  structure  of  the  application.  It  was  also  shown  that  the 
applications  had  relatively  modest  bandwidth  requirements 
compared  to  the  dedicated  network’s  capacity.  Most  appli¬ 
cations  did  not  slow-down  significantly  until  the  bandwidth 
was  effectively  reduced  to  approximately  12%  of  its  normal 
capacity.  Indeed,  even  the  increased  latencies  in  this  clus¬ 
ter  were  well  below  those  found  in  a  grid  and  the  reduced 
bandwidths  were  above  the  90th  percentile.  In  a  general 
grid  environment,  these  results  would  be  different. 

In  the  light  of  these  considerations,  the  next  question  to 
ask  is  “How  tightly  coupled  do  distributed,  heterogeneous 
grid  applications  need  to  be  or  can  be?”  Clearly  not  all  ap¬ 


plications  are  tightly  coupled  or  need  to  be,  in  the  sense 
that  a  CFD  code  is  tightly  coupled.  Applications  that  con¬ 
nect  unique  resources,  such  as  X-ray  sources,  with  visual¬ 
ization  devices,  such  as  CAVEs,  typically  rely  on  a  func¬ 
tional  decomposition  that  is  more  tolerant  of  the  dynamic 
range  of  bandwidth  within  a  machine  and  between  ma¬ 
chines.  Nonetheless,  all  distributed  applications  will  run 
better  with  faster  networks.  This  is  not  news.  In  the  con¬ 
text  of  the  World  Wide  Web,  most  people  probably  feel  that 
downloads  are  too  slow.  In  part,  the  notion  of  quality  of  ser¬ 
vice  is  to  provide  a  “floor”  to  the  performance  that  a  user 
receives  from  a  shared  resource,  e.g.,  a  network. 

The  opinion  is  also  held  that  flexibility  is  actually  more 
important  for  grid  applications  than  performance  manage¬ 
ment.  For  a  large  class  of  applications,  this  will  be  true.^ 
The  grid  is  being  designed  to  make  it  as  easy  as  possible  to 
compose  disparate  resources  such  as  specialized  databases, 
unique  instruments,  and  embedded  systems.  For  another 
large  class  of  applications,  however,  the  grid  holds  the 
promise  of  applying  very  large  amounts  of  aggregate  com¬ 
pute  power  to  very  large  problems  that  is  not  economically 
feasible  any  other  way.  Hence,  what  can  be  done  to  manage 
performance  across  these  bandwidths  and  latencies? 

Cluster  computing  can,  again,  be  used  as  a  point  of  de¬ 
parture.  Several  projects  have  been  reported  that  deal  with 
programming  clusters  of  SMPs,  or  clumps ,  where  the  het¬ 
erogeneity  of  in-memory  communication  vs.  network  com¬ 
munication  is  the  central  issue.  The  SIMPLE  model  [2],  for 
example,  provides  a  simple  set  of  collective  operations  that 
are  handled  by  different  modules  for  intra-node  and  inter¬ 
node  communication.  KeLP  and  its  Data  Mover  [5,  1]  take 
a  different  approach.  KeLP  defines  a  set  of  meta-data  ab¬ 
stractions,  such  as  Region,  Map,  Floor  Plan  and  Mo- 
tionPlan,  that  capture  the  geometry  of  block-structured 
decomposition  and  the  resulting  data  dependencies  in  par¬ 
allel  execution.  The  current  Data  Mover  implementation 
uses  a  private  MPI  communicator  and  asynchronous  point- 
to-point  messages  to  actually  move  the  data. 

An  important  issue  for  communication  libraries  or  run¬ 
time  systems  that  support  higher-level  semantics,  however, 
is  that  of  irregular  communication;  communication  that 
does  not  follow  a  regular,  geometric  pattern  and  may  be  dy¬ 
namic  and  not  known  until  run-time.  This  issue  has  been 
faced  by  the  High  Performance  Fortran  (HPF)  community 
for  some  time.  This  has  given  rise  to  an  inspector-executor 
paradigm  where  an  inspector  routine  does  a  run-time  analy¬ 
sis  of  array  accesses  for  communication  and  derives  a  com¬ 
munication  schedule  that  is  then  used  by  the  executor  rou¬ 
tine  to  actually  perform  the  communication.  Since  the  in¬ 
spector  routine  can  be  very  time-consuming,  there  has  been 
work  done  on  minimizing  its  overhead  and  reusing  any 
schedules  produced  [3]. 

For  some  applications,  it  will  be  best  to  use  a  program- 
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ming  model  that  does  not  hide  the  heterogeneity  of  the  un¬ 
derlying  resources  and  requires  the  application  builder  to 
hand-code  the  application  to  the  resources.  There  are,  of 
course,  great  benefits  in  not  having  to  hand-code  applica¬ 
tions  to  tolerate  bandwidths  and  latencies.  For  these  situa¬ 
tions,  dealing  with  a  heterogeneous  infrastructure  means  ad¬ 
dressing  the  fundamental  problems  of  (1)  data  locality  and 

(2)  scheduling,  where  scheduling  in  this  context  means  both 
communication  and  execution  scheduling  which  are,  in  fact, 
interdependent.  A  clearly  attractive  approach  is  to  require 
sufficient  concurrency  in  the  application  such  that  a  coarse- 
grain,  data-driven  model  of  execution  can  be  used.  The 
initial  challenge  is  to  hide  latency  with  concurrency  while 
keeping  context  switching  overheads  low.  The  next  chal¬ 
lenge  is  to  encapsulate  this  low-level  performance  manage¬ 
ment  into  an  easy-to-use  package  or  component.  If  this  is 
possible,  then  other  established  techniques  such  as  caching, 
compression,  estimation  and  speculative  pre-fetching,  could 
also  be  used. 

Finally  we  note  that  applications  tend  to  have  their  own, 
natural  “problem  architecture”  and  some,  by  their  very  na¬ 
ture,  are  more  tightly  coupled  than  others.  As  soon  as  a 
distributed  implementation  is  considered,  it  imparts  a  three- 
dimensional  or  spatial  “density  distribution”  to  the  compu¬ 
tation.  This  density  and  the  available  bandwidth  and  latency 
become  part  of  the  algorithmic  complexity  governing  per¬ 
formance.  Some  applications  will  have  unavoidable  spa¬ 
tial  constraints  that  will  be  best  addressed  by  recasting  the 
problem  and  its  solution  in  a  more  loosely  coupled  fash¬ 
ion.  The  Barnes-Hut  algorithm,  for  example,  solves  the 
N-Body  problem  in  less  than  0(n2)  complexity  by  repre¬ 
senting  space  with  an  octree  such  that  from  any  given  body, 
groups  of  far  away  bodies  can  be  represented  as  a  point 
source. 

The  challenge  for  grids  and  heterogeneous  computing, 
however,  is  to  minimize  the  class  of  applications  that  have 
to  be  recast  by  developing  systems  and  runtimes  that  under¬ 
stand  the  “spatial  component”  of  an  application  and  can  act 
accordingly  to  provide  the  best  overall  performance  with  the 
available  communication  resources.  This  is  one  of  the  goals 
of  the  Grid  Forum’s  Advanced  Programming  Models  Work¬ 
ing  Group  (www.gridforum.org).  The  need  for  such  “spa¬ 
tial  component”  management  will  only  increase  as  systems 
like  long-latency  satellite  networks  and  low-power  mobile 
networks  come  online  with  high-performance  compute  sys¬ 
tems  such  as  hardware  multithreaded  processors  that  toler¬ 
ate  deep  memory  hierarchies. 
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Abstract 

The  Portable  Parallel/Distributed  Debugger  project  at 
the  NASA  Ames  Research  Center  has  built  a  debugger  for 
applications  running  on  heterogeneous  computational 
grids.  It  employs  a  client-server  architecture  to  simplify  the 
implementation ,  and  its  user  interface  has  been  designed  to 
provide  process  control  and  state  examination  functions  on 
computations  with  a  large  number  of  processes .  The 
debugger  can  find  processes  participating  in  distributed 
computations  even  when  those  processes  were  not  created 
under  debugger  control.  In  addition  to  working  in  a  com¬ 
putational  grid  environment ,  these  techniques  also  work  on 
other  distributed  memory  jobs  such  as  those  initiated  by 
mpirun. 

1.  Introduction 

While  tools  for  debugging  computationally  intensive 
programs  have  improved  substantially  in  the  last  few  years 
[5]  [22],  there  are  two  areas  where  further  improvement  is 
needed.  First,  existing  tools  do  not  cope  well  with  applica¬ 
tions  running  on  heterogeneous  computing  platforms.  Sec¬ 
ond,  they  do  not  provide  sufficiently  abstract  and  scalable 
operations  for  examining  and  controlling  execution. 

This  combination  of  inadequacies  is  particularly  felt  by 
programmers  building  applications  to  run  on  large-scale 
computational  grids,  such  as  NASA’s  Information  Power 
Grid  (IPG)  [12].  The  IPG  is  based  on  the  Globus  toolkit  [7] 
and  can  give  an  application  access  to  a  variety  of  comput¬ 
ing  resources  across  the  country.  Debugging  such  a  compu¬ 
tation  using  existing  techniques  is  at  best  very  tedious.  In 
the  worst  case,  it  may  not  be  possible. 

In  order  to  provide  a  reasonable  debugging  system  for 
computational  grid  computations,  the  Portable  Parallel/ 
Distributed  Debugger  (p2d2 )  project  in  the  Numerical 
Aerospace  Simulation  (NAS)  Division  of  the  NASA  Ames 

"'This  work  was  supported  through  NASA  contract  NAS  2-14303. 

The  screen  dumps  in  this  paper  have  been  modified  from  their  normal 
screen  appearance  in  order  to  aid  reproducibility.  The  modifications 
include  color  changes  as  well  as  a  resizing  of  some  components  from  their 
defaults. 


Research  Center  has  taken  their  existing  debugger  and 
extended  its  capabilities.  The  original  goals  of  the  project 
in  1994  were  to  build  a  debugger  that  was  both  portable 
across  a  variety  of  target  machines  and  whose  user  inter¬ 
face  scaled  to  be  able  to  debug  a  large  number  of  processes. 
(At  that  time  we  interpreted  that  to  mean  at  least  256  pro¬ 
cesses.)  The  result  of  that  initial  effort  was  a  debugger  [10] 
that  ran  on  a  variety  of  Unix-based  machines  and  could  be 
used  on  both  MPI  [15]  and  PVM  [21]  applications. 

In  this  paper  we  report  on  the  effort  to  enhance  that 
debugger  to  work  in  a  computational  grid  environment.  We 
begin  with  a  discussion  of  how  p2d2' s  architecture  accom¬ 
modates  the  debugging  of  heterogeneous  computations.  In 
section  3  we  look  at  user  interface  features  that  enhance 
scalability.  Following  that  we  discuss  how  p2d2  meets  the 
requirements  imposed  by  a  computational  grid  environ¬ 
ment.  In  section  5  we  examine  how  heterogeneity  affects 
the  user  interface  components. 

2.  An  architecture  to  support  heterogeneity 

Debuggers,  even  serial  ones,  are  inherently  nonportable. 
Their  basic  task  is  to  take  a  user  request  at  the  source  level, 
map  it  to  the  machine  level  where  it  can  be  performed,  and 
then  map  the  result  back  to  the  source  level.  To  accomplish 
this  they  rely  on  information  and  services  from  a  variety  of 
sources.  For  example: 

•  the  compiler  provides  source  line  and  symbol  map¬ 
ping  data, 

•  the  operating  system  provides  services  for  starting 
and  stopping  processes,  and 

•  the  computer  architecture  defines  a  trap  instruction 
that  can  be  used  for  implementing  breakpoints. 

The  most  successful  portable  debugger,  gdb  from  the 
Free  Software  Foundation  [6],  defines  abstract  interfaces 
for  many  low-level  functions  in  a  debugger,  such  as  read¬ 
ing  values  in  another  process’s  address  space.  The  gdb 
source  distribution  includes  machine  specific  implementa¬ 
tions  for  those  functions,  and  at  compile  time  it  determines 
which  code  needs  to  be  present  to  build  a  debugger  for  a 
given  platform.  One  problem  that  gdb  does  not  attempt  to 
solve  is  that  of  debugging  heterogeneous  computations, 
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other  requests  to  gdb,  such  as  “where”  to  find  out  what  the 
runtime  stack  looks  like.  When  it  has  completed  its  picture 
of  the  process  state,  it  notifies  the  client  that  the  process  has 
stopped. 

3.  User  interface  basics 


FIGURE  1 .  The  client-server  architecture  of  p2d2. 

where  these  portability  issues  must  be  solved  in  a  way  that 
enables  different  target  platforms  to  be  available  at  the 
same  time. 

In  the  p2d2  project  we  wanted  to  address  the  portability 
issues  at  a  higher  level  of  abstraction  than  gdb  did.  We  used 
a  client-server  architecture  to  isolate  the  platform-d.epen- 
dent  code  in  a  debugger  server  (see  Figure  1).  The  server 
defines  a  collection  of  C++  objects  that  would  exist  in  a 
debugging  session,  such  as  Process  and  Stack.  The  cli¬ 
ent  consists  of  those  parts  of  the  debugger  that  deal  with 
the  distributed  nature  of  the  target  computation  and  with 
the  user.  It  can  be  implemented  in  a  highly  portable  fash¬ 
ion.  For  example,  if  the  client  has  a  Process  *p,  it  could 
resume  execution  in  it  by  invoking  the  operation 
p->Continue { ) . 

The  object  collection  is  discussed  in  detail  in  a  previous  pa¬ 
per  [11]. 

In  the  initial  version  of  p2d2  we  decided  to  build  a 
debugger  server  based  on  gdb.  The  main  reason  for  this 
was  that  our  debugger  would  be  easily  portable  to  any  plat¬ 
form  where  a  gdb  implementation  existed — thus  saving  us 
a  huge  amount  of  implementation  effort.  In  the  gdb- based 
implementation  of  p2d2  (see  Figure  2)  the  remote  server  of 
Figure  1  is  replaced  with  an  instance  of  gdb.  The  debugger 
server  is  then  an  implementation  of  the  C++  objects  that 
uses  gdb  commands  to  perform  any  requested  debugger 
server  requests.  In  effect,  it  maps  operations  on  the  C++ 
objects  into  gdb  commands  and  then  maps  the  gdb 
response  back  to  the  object  level. 

To  continue  the  example  above,  if  the  client  invokes  a 
continue  operation  on  a  process  with  “p->Continue( )  ” 
the  server  sends  a  “cont”  request  to  the  gdb  controlling 
that  process  and  marks  the  process  as  “running”.  When  that 
process  hits  a  breakpoint,  gdb  reports  it  to  the  debugger 
server  which  analyzes  the  reason  for  stopping  and  updates 
its  own  picture  of  the  process  state.  In  doing  so,  it  may  send 


FIGURE  2.  The  gdb-based  implementation  of 
p2d2. 


From  a  user’s  perspective,  a  debugger  has  two  primary 
functions: 

•  state  examination ,  where  the  user  can  scrutinize 
expression  values,  source  code,  run-time  stack,  and 
other  components  of  the  current  computational 
state;  and 

•  process  control ,  where  the  user  is  permitted  to  start 
execution  of  the  target  computation  and  to  describe 
circumstances  under  which  it  should  stop. 

The  challenge  in  a  multiprocess  debugger  is  to  provide 
these  functions  in  a  way  that  scales  well  to  a  large  number 
of  processes.  In  particular,  the  challenge  for  state  examina¬ 
tion  is  to  provide  both  an  abstract,  top-level  view  of  the 
computation  as  well  as  information  about  a  single  process 
that  has  the  same  level  of  detail  that  a  serial  debugger  would 
have.  The  challenge  for  process  control  is  to  provide  a  way 
to  propagate  a  single  process  control  request,  such  as  Con¬ 
tinue ,  to  a  collection  of  processes,  thereby  relieving  the  user 
from  the  burden  of  directing  processes  individually. 

To  address  the  state  examination  challenge,  p2d2 
defines  three  zooming  levels,  providing  a  varying  degree  of 
abstraction  versus  detail. 

•  A  top-level  view,  called  the  process  grid ,  provides  a 
programmable  display  showing  a  few  bits  of  infor¬ 
mation  about  each  of  the  processes  in  the  computa¬ 
tion. 

•  An  intermediate-level  view  provides  a  line  of  text 
summarizing  the  state  of  each  process  in  a  user- 
selected  set  called  the  focus  group. 

•  A  low-level  view  provides  full  information  about  a 
single,  user-selected  focus  process. 

The  selection  of  the  focus  group  and  the  focus  process  are 
done  in  the  process  grid  display.  When  the  user  changes  one 
of  the  selections,  the  display  is  updated  to  reflect  the  infor¬ 
mation  about  the  new  focus.  For  example,  if  the  user  chang¬ 
es  the  focus  process,  then  all  state  examination  displays  that 
are  focus-process-sensitive,  such  as  the  source  and  stack 
displays,  will  be  updated. 

To  address  the  process  control  challenge,  p2d2  uses  the 
notion  of  a  control  set ,  which  is  the  collection  of  target  pro¬ 
cesses  that  are  subject  to  process  control  requests.  The  user 
has  a  variety  of  ways  of  setting  membership  in  the  control 
set.  The  current  membership  of  the  control  set  is  indicated 
in  the  process  grid  display.  When  a  process  control  opera¬ 
tion  such  as  setting  a  breakpoint  or  continuing  execution  is 
requested,  it  is  forwarded  to  all  processes  in  the  control  set. 
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FIGURE  3.  The  main  window  of  p2d2. 


The  main  window  of  p2d2  is  shown  in  Figure  3.  In  that 
example,  a  Globus  computation  of  32  processes  is  being 
debugged.  The  process  grid  shows  all  of  the  processes  in 
the  computation,  plus  the  globusrun  process  that  initiated 
the  job  (but  does  not  participate  in  the  computation).  The 
focus  group  displays  one  line  of  text  for  each  process  in  the 
selected  column  of  the  process  grid.  In  the  figure,  the  user 
has  selected  the  fourth  of  the  nine  columns.  The  focus  pro¬ 
cess  part  of  the  display  resembles  a  serial  debugger  on  the 
single  process  selected  in  the  process  grid  (the  one  in  row 
3,  column  6). 

Perhaps  the  most  novel  feature  of  the  p2d2  user  inter¬ 
face  is  the  programmability  of  the  process  grid  described 
earlier.  This  feature  permits  a  quick  scan  of  a  large  number 
of  processes  to  isolate  a  process  behaving  in  an  unexpected 
manner.  Such  a  process  is  a  good  candidate  for  closer  scru¬ 
tiny  as  the  focus  process. 


This  customization  is  achieved  by  having  the  user  spec¬ 
ify  a  list  of  predicates  that  should  be  tested  in  each  process 
and  how  a  process  should  be  depicted  in  the  process  grid  if 
the  predicate  is  true  about  it.  For  example,  in  the  default 
view  of  the  display,  running  processes  are  represented  by 
green  squares  and  stopped  processes  by  red  ones. 

P2d2  defines  a  collection  of  conditions  that  can  be 
tested.  These  include: 

•  the  process  is  running, 

•  the  process  is  on  some  machine  M , 

•  an  expression  E  evaluates  to  true  in  the  process,  and 

•  the  process  is  stopped  in  user  function  Fj  and  is  call¬ 
ing  non-user  function  F2. 

The  customization  feature  is  illustrated  in  Figure  4.  In  that 
example,  the  user  suspected  that  a  computation  was  dead¬ 
locked.  She  paused  all  of  the  processes  and  then  requested 
that  the  process  grid  show  where  each  process  was  stopped. 
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FIGURE  4.  Customizing  the  process  grid  display. 


The  debugger  constructed  the  customization  shown  in  the 
“Custom  Grid  Display  Editor*'  window.  It  shows  running  pro¬ 
cesses  with  a  green  square — only  the  globusrun  startup  pro¬ 
cess  is  in  this  category.  Stopped  processes  are  depicted  with 
an  “x”  if  they  are  in  mesh_update_bdry_asynch  and 
calling  mpi_recv;  they  are  depicted  with  a  “o”  if  they  are 
in  that  same  function  but  calling  mpi_barrier.  This  fea¬ 
ture  gives  the  user  a  quick  way  to  find  out  what  each  process 
is  doing.  In  particular,  in  this  example,  the  user  was  able  to 
focus  in  on  the  four  processes  doing  an  mpi_recv,  and  find 
a  communication  pattern  error. 

In  addition  to  providing  tools  for  abstracting  state  across 
a  collection  of  process,  p2d2  also  provides  various  means 
to  examine  specific  data  values  in  a  computation.  Scalar 
expressions  can  be  evaluated  on  each  process  in  the  control 
set  with  the  result  being  displayed  in  the  output  window 
(the  bottom  pane  in  Figure  3).  As  an  alternative,  a  scalar 
data  viewer  allows  the  persistent  display  of  scalar  values 
for  up  to  4  focus  processes  (Figure  5).  Data  values  there 
are  updated  each  time  a  breakpoint  is  hit  or  when  the  focus 
is  shifted  to  another  process. 

Array  data  can  be  examined  using  the  p2d2  array 
viewer.  It  displays  the  array  in  either  textual  or  graphical 
mode  for  each  of  the  focus  processes.  Figure  6  shows  an 
example  of  the  data  displayed  as  text.  As  in  the  scalar 
viewer,  the  values  are  updated  when  the  program  reaches  a 
breakpoint  or  when  the  focus  is  shifted  to  another  process. 


FIGURE  5.  Comparing  data  across  processes. 


4.  Handling  grid-based  computations 

In  addition  to  state  examination  and  process  control  fea¬ 
tures,  a  successful  debugger  will  need  to  automate  the  task 
of  finding  and  controlling  all  of  the  processes  participating 
in  a  distributed  computation.  The  user  should  not  be 
required  to  filter  through  lists  of  processes  running  on  a 
large  number  of  machines  in  order  to  determine  which  of 
them  belongs  to  a  job. 

As  with  serial  debuggers  there  are  two  cases  to  consider 
in  acquiring  initial  control  over  processes  to  be  debugged: 

1.  the  computation  was  initiated  from  the  debugger 
when  the  user  invoked  the  Run  command,  and 

2.  the  user  initiated  the  computation  outside  of  the 
debugger  and  then  requested  that  the  debugger 
“attach”  to  it. 

In  order  to  handle  case  1 ,  the  debugger  needs  to  resolve  a 
conflict  with  the  process  starting  mechanism  (e.g.,  mpirun , 
globusrun ,  pvmrun)  that  initiates  the  distributed  computa¬ 
tion.  The  conflict  comes  about  because  both  the  debugger 
and  the  process  starter  want  to  control  the  actual  fork() 
and  exec  ( )  that  start  the  individual  processes.  A  custom¬ 
ary  way  to  resolve  this  conflict  is  for  the  process  starting 


FIGURE  6.  Array  data  viewed  as  text 
(compare  with  graphical  view  in  Figure  11). 
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FIGURE  7.  After  the  user  requests  a  globusrun. 


mechanism  to  allow  a  user-supplied  proxy  program  (some¬ 
times  called  a  tasker)  to  perform  the  fork  and  exec.  Both 
pvmrun  and  globusrun  permit  the  debugger  to  gain  control 
over  process  creation  in  this  way. 

To  debug  Globus  jobs,  p2d2  uses  the  tasking  mecha¬ 
nism  provided  by  globusrun.  If  the  debugger  is  going  to  be 
used  to  initiate  a  Globus  job,  the  user  must  include  the 
clause 

(paradyn= " P2D2_HOST  P2D2_PORT  p2d2  \ 
/u/p2d2 /bin/gdbserver " ) 

in  the  RSL  script  to  be  handed  off  to  globusrun.  This  indi¬ 
cates  that  /u/p2d2 /bin/gdbserver  should  be  used  as  a 
tasker.  When  the  user  requests  a  Run ,  the  following  se¬ 
quence  of  events  happens.  It  is  depicted  in  Figure  7. 

1.  P2d2  invokes  globusrun ,  changing  the  P2D2_HOST 
and  P2D2__port  strings  in  the  RSL  script  to  the 
machine  name  on  which  p2d2  is  running  and  the 
number  of  a  tasker  contact  port  that  it  created. 

2.  When  globusrun  starts  the  tasker,  it  passes  it  the 
machine  name  and  port  number  that  p2d2  wrote  in 
the  RSL  script. 

3.  The  tasker  and  p2d2  then  establish  a  socket. 

4.  The  tasker  starts  the  target  executable  and  reports  the 
target’s  pid  on  the  socket.  The  target  sleeps. 

5.  P2d2  asks  the  tasker  to  start  gdb  and  to  forward  an 
attach  request  to  it. 

6.  Gdb  attaches  to  the  target  to  take  control. 


In  order  to  handle  case  2  above,  where  the  user  requests 
that  the  debugger  attach  to  an  existing  computation,  the 
debugger  needs: 

•  a  list  of  the  processes  that  are  participating  in  a  com¬ 
putation,  and 

•  a  mechanism  for  gaining  control  over  them. 

If  a  tasking  mechanism  exists,  it  can  be  used  to  meet  these 
needs.  For  example,  if  p2d2  is  to  be  used  to  attach  to  an  ex¬ 
isting  Globus  job,  the  job  must  have  been  started  with  the 
“paradyn”  option  described  previously.  Then  the  follow¬ 
ing  steps  (illustrated  in  Figure  8)  take  place. 

1 .  Globusrun  creates  a  tasker  process  for  each  target 
process  in  the  computation. 

2.  The  resulting  tasker  processes  will  each  create  a  port. 

3.  The  port  contact  information  from  all  taskers  is  com¬ 
bined  in  a  single  file  in  the  file  system. 

4.  When  the  user  starts  up  p2d2  and  asks  for  it  to  attach 
to  the  Globus  processes,  the  debugger  will  retrieve 
the  tasker  port  contact  information  in  the  file. 

5.  P2d2  will  then  establish  sockets  with  the  taskers. 

6.  The  debugger  will  then  ask  the  tasker  to  start  up  a 
gdb  and  pass  an  attach  request  to  it. 

7.  Gdb  will  then  attach  to  the  target  process. 

Storing  the  tasker  contact  information  in  the  file  system 
can  be  problematic.  The  machine  where  p2d2  runs  may  not 
mount  the  same  file  system  that  the  taskers  do.  In  fact,  the 
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taskers  themselves  may  not  share  a  common  file  system. 
Under  Globus,  the  right  way  for  the  taskers  to  get  the  con¬ 
tact  information  to  p2d2  is  to  use  the  Metacomputing 
Directory  Service  (MDS).  We  are  currently  modifying  our 
tasker  to  use  that  approach. 

In  our  discussions  so  far,  we  have  relied  on  a  tasking 
mechanism  at  process  startup.  Unfortunately  the  initial  ver¬ 
sion  of  MPI  does  not  have  such  a  feature,  because  process 
creation  was  not  part  of  the  standard.  To  handle  MPI  jobs 
when  there  is  no  tasking  mechanism,  p2d2  uses  rsh  to  run  a 
copy  of  gdb  on  the  machine  where  the  target  process  exists. 
There  are  two  remaining  needs: 

•  a  list  of  pairs  [ machine ,  pid]  for  each  process  in  the 
job,  and 

•  a  way  to ‘keep  a  newly  started  MPI  process  from 
executing  code. 

The  second  condition  allows  us  to  handle  debugger-initiat¬ 
ed  runs  in  an  identical  manner  to  run  initiated  outside  of  the 
debugger.  To  handle  a  Run  request  in  this  scenario,  p2d2  in¬ 
vokes  mpirun,  which  starts  the  processes  on  the  remote  ma¬ 
chine.  If  we  have  a  way  to  keep  the  newly  started  MPI 
process  from  making  progress,  we  can  simply  attach  to  it  as 
we  do  for  runs  initiated  outside  of  the  debugger. 

We  can  address  both  of  the  needs  above  by  using  the 
profiling  mechanism  of  MPI  and  providing  a  specialized 
version  of  MPl_lnit().  The  MPl„init  used  by  p2d2 
does  the  following. 

•  It  calls  PMPi_init  ( ) ,  to  do  the  normal  initializa¬ 
tion  for  MPI. 

•  The  process  with  rank  0  gets  the  machine  name  and 
process  ID  for  all  processes.  It  writes  that  data  in  the 
file  system. 

•  If  the  process  was  initiated  from  the  debugger,  it 
_ goes  into  an  infinite  sleep  loop. 


When  the  debugger  attaches,  it  establishes  any  necessary 
breakpoints,  terminates  the  sleep  loop,  and  then  continues 
execution. 

There  are  two  minor  limitations  in  the  version  of 
MPl__lnit  used  by  p2d2: 

•  it  is  not  possible  to  debug  the  code  that  executes 
before  MPi_init  called,  and 

•  the  user  must  link  the  application  with  p2dT s  ver¬ 
sion  of  MPI_Init. 

The  latter  condition  could  lead  to  a  conflict  if  other  libraries 
want  to  use  the  profiling  mechanism  of  MPI. 

While  these  limitations  exist,  in  practice  they  restrict 
p2d2' s  capabilities  very  little.  Furthermore,  we  are  hopeful 
that  an  mpirun  based  on  the  process  control  operations  in 
MPI-2  [15]  will  provide  a  tasking  mechanism  that  will 
eliminate  the  restrictions  altogether. 

5,  Heterogeneity  and  the  user  interface 

In  adapting  p2d2  to  work  in  a  heterogeneous  computa¬ 
tional  grid  environment,  we  found  two  areas  that  needed 
more  work: 

•  displaying  what  kind  of  machine  and  operating  sys¬ 
tem  a  process  was  running  on,  and 

•  providing  abstract,  consistent  views  of  data  across 
heterogeneous  processors. 

The  first  problem  was  relatively  easy  to  solve.  P2d2  extracts 
system  type  information  from  its  debugger  servers  and  then 
displays  it  in  two  different  ways,  as  shown  in  Figure  9.  First, 
it  puts  system  information  in  the  focus  group  display.  Sec¬ 
ond,  it  defines  a  predicate  “process  is  on  operating  system 
S ”  so  system  information  can  be  displayed  in  the  process 
grid.  In  the  example  shown  in  Figure  9,  the  grid  view  is  pro¬ 
grammed  so  that  processes  running  on  IRIX  are  depicted  as 


|  v/  o  1  jif  process  is  on  OS  matching: 

T . *1 

mips-sgi-irix6 .5  j 

1 X  o  |  jif  process  is  on  OS  matching: 

a  1 

sparc-sun-solaris2 .6  i 

i 

~~~ . . . . . - .  -- . . . .1  j 

!oa  |  jif  process  is  on  OS  matching: 

. °1 

i386-redhat-linux  j 

I?  b  |  (otherwise 

a  | 

8 

1 

1  Add  another  casej  (  OK  | 

1  Apply  | 

Reset  |  |  Cancel  |  |  Help  |  fe 

FIGURE  9.  Support  for  heterogeneity  in  the  process  grid. 
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a  “V”,  processes  running  on  Solaris  are  indicated  with  an 
“x’\  and  processes  running  under  Linux  show  a  This 
results  in  the  process  grid  view  as  shown. 

To  address  the  second  problem,  that  of  providing  con¬ 
sistent,  abstract  representations  of  program  state  across 
heterogeneous  processes,  we  needed  to  make  the  existing 
state  examination  tools  more  robust  and  to  provide  some 
new  ones  as  well.  One  of  the  problems  we  ran  into  when 
first  looking  at  heterogeneous  computations  concerned  pro¬ 
viding  automatic  assistance  for  comparing  the  value  of  an 
expression  in  processes  on  different  architectures.  Data 
representation  was  not  an  issue  because  gdb  provides  the 
expression’s  value  as  text.  Instead  the  issue  was  that  of 
finding  where  in  the  process  the  evaluation  should  take 
place. 

Expression  evaluation  in  p2d2  has  always  tried  to  make 
sure  that  the  user  compares  apples  to  apples.  That  is,  when 
evaluating  an  expression  on  more  than  one  process,  the 
debugger  attempts  to  use  the  same  context  in  each  of  the 
processes  doing  the  evaluation.  So,  if  a  variable  is  being 
evaluated  in  2  processes,  the  evaluation  will  take  place  in 
stack  frames  that  are  “similar”.  What  this  means  is  that  the 
debugger  needs  to  compare  the  runtime  stacks  of  the  non¬ 
focus  processes  in  order  to  determine  which  frame  best 
corresponds  to  the  selected  frame  in  the  focus  process.  In  a 
homogeneous  environment  this  is  not  too  difficult.  The 
problem  we  needed  to  address  in  a  heterogeneous  environ¬ 
ment  was  that  the  runtime  stacks  looked  somewhat  differ¬ 
ent.  In  particular,  function  names  often  changed  slightly. 
We  addressed  the  problem  by  mapping  function  names  to  a 
canonical  form.  Then  stack  comparison  could  be  handled 
as  in  the  homogeneous  case. 

In  order  to  increase  the  abstraction  level  of  our  data  dis¬ 
plays,  we  wanted  to  address  the  issue  of  displaying  data 
from  arrays  that  are  conceptually  distributed  across  multi¬ 
ple  processes.  Thus,  p2d2’ s  array  viewer  provides  a  mecha¬ 
nism  to  give  the  user  a  global  picture  of  a  distributed  array. 
The  local  data  contributions  from  each  of  the  participating 
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FIGURE  10.  A  global  view  of  distributed  data. 


processes  are  gathered  and  assembled  into  a  global  picture. 
When  gathering  the  data  from  different  machine  architec¬ 
tures  we  had  to  take  into  account  inconsistencies  of  gdb 
across  different  compilers.  An  example  is  the  “whatis” 
command.  For  a  Fortran  array  declared  as  real  a  ( 10 , 5 ) 
on  a  Linux  platform  using  g77  this  results  in  type  = 
real *4  (10,5).  On  a  SGI  Origin  using  the  MIPSPro 

compiler  it  results  in  type  =  real* 4  (5,10).  In  this 

case,  p2d2  addresses  the  differences  by  reversing  dimen¬ 
sion  lists  on  the  SGI’s. 

Figure  10  shows  a  global  display  of  a  2-dimensional 
slice  of  the  4  dimensional  array  ux  at  a  breakpoint.  The 
array  ux  is  distributed  across  8  processes:  4  SGI  Origins,  2 
processes  on  a  Sun  Solaris  platform,  and  2  processes  on  a 
Lintix  PC  cluster.  The  array  elements  that  reside  on  the 
focus  process  are  highlighted.  To  make  comparison  sim¬ 
pler,  Figure  1 1  shows  the  local  contribution  from  the  focus 
process. 

In  order  to  assemble  the  local  contributions  of  a  distrib¬ 
uted  array  into  a  global  picture,  information  about  how  the 
data  is  distributed  is  required.  If  the  program  has  been  par¬ 
allelized  without  the  use  of  parallelization  support  tools, 
p2d2  will  prompt  the  user  to  provide  distribution  informa¬ 
tion  via  a  dialog  box  (Figure  12).  At  the  moment  only  sim¬ 
ple,  structured  distribution  types  are  supported. 

In  cases  where  the  program  has  been  parallelized  using 
a  parallelization  support  tool,  it  is  often  possible  to  retrieve 
such  information  through  the  tool  that  has  been  used.  Cur¬ 
rently  p2d2  supports  the  CAPTools  [3]  parallelization  tool, 
which  was  developed  at  the  University  of  Greenwich. 
CAPTools  generates  parallel  code  from  a  serial  program 
by  performing  extensive  dependence  analysis,  logically 
partitioning  the  data,  and  inserting  calls  to  communication 
routines.  The  analysis  results  gathered  during  this  process 
are  stored  in  a  data  base,  which  is  then  probed  by  the 
debugger  to  retrieve  the  required  distribution  information 
without  user  intervention.  Some  of  the  information  stored 
in  the  CAPTools  database  is  symbolic  and  has  to  be  evalu- 


FIGURE  11.  Local  array  data. 
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FIGURE  12.  Specifying  a  distribution. 


ated  by  p2d2  for  each  processor  at  run  time.  For  example 
the  upper  and  lower  loop  bounds,  which  determine  the 
effectively  used  area  in  a  local  array,  are  stored  symboli¬ 
cally.  These  bounds  vary  with  the  number  of  processors 
and  are  potentially  different  for  each  processor. 

6.  Related  work 

There  are  two  commercially  available  distributed 
debuggers  of  note.  TotalView  [5],  from  Etnus,  is  a  third 
party  debugger  that  runs  on  a  number  of  high  performance 
computing  platforms.  It  is  currently  not  capable  of  debug¬ 
ging  heterogeneous  computations.  Furthermore,  while  it 
can  debug  thousands  of  processes  and  threads,  the  user 
interactions  are  at  a  fairly  low  level.  Prism  [22],  from  Sun 
Microsystems  is  derived  from  the  Thinking  Machines 
product  of  the  same  name.  It  is  not  portable  to  systems 
other  than  Sun.  While  its  user  interface  led  the  way  in  seal- 
ability,  it  too,  could  be  more  abstract. 

SGI’s  Jessie  [19]  is  a  freely  available,  cross  platform 
development  environment  that  provides  a  debugger  based 
on  gdb  and  a  performance  analysis  tool  based  on  gprof. 
Like  p2d2 ,  Jessie  is  aimed  at  providing  portability.  When  it 
comes  to  debugging  programs  consisting  of  multiple  pro¬ 
cesses,  Jessie  is  limited  to  what  gdb  supports.  That  means 
while  it  is  possible  to  invoke  several  instances  of  gdb  to 
debug  multiple  processes  on  different  machines,  to  our 
knowledge  Jessie,  at  this  time,  does  not  provide  means  to 
control  them  in  a  convenient,  scalable  way. 

Guard  is  a  debugger  developed  at  Griffith  University, 
Australia  [1][2].  It  provides  the  ability  to  debug  programs 
in  a  distributed  and  heterogeneous  environment  by  allow¬ 
ing  control  of  execution  of  separate  programs  on  different 
machines.  Like  p2d2 ,  it  uses  a  client-server  paradigm  to 
provide  portability.  A  gdfr-based  debugger  server  runs  on 


each  of  the  machines  to  control  the  processes.  The  debug¬ 
ger  servers  communicate  with  the  client  via  RPC.  Guard 
provides  a  command  language  for  user  interaction  that 
contains  commands  like  “compare”  and  “assert”  to  com¬ 
pare  values  between  programs  that  are  running  on  different 
machines,  and  were  possibly  written  in  different  lan¬ 
guages.  It  also  allows  the  comparison  between  parallel  and 
sequential  versions  of  a  program  by  providing  language 
constructs  that  enable  the  user  to  map  a  serial  data  structure 
onto  the  equivalent  parallel  version. 

The  Distributed  Array  Query  and  Visualization  (DAQV) 
project  [9]  aims  to  provide  a  solution  for  the  problem  of 
exposing  distributed  data  structures  to  external  tools.  The 
original  work  started  as  a  Parallel  Tools  Consortium  [16] 
project  and  focused  on  HPF  as  a  target  language.  Informa¬ 
tion  about  the  distributed  array  could  be  obtained  via  the 
HPF  compiler.  In  the  second  phase  of  the  project  (. DAQV - 
II),  Fortran  90  and  MPI  became  the  primary  implementa¬ 
tion  targets  [8].  As  in  p2d2 ,  DAQV-II  requests  array  distri¬ 
bution  information  from  the  user  if  it  can  not  be  obtained 
otherwise. 

The  SPiDER  debugging  system  [20]  for  HPF  programs 
uses  the  GDDT  (Graphical  Data  Distribution  Tool)  [13]  for 
the  display  of  distributed  arrays.  It  doesn’t  appear  to  sup¬ 
port  viewing  arrays  distributed  across  a  heterogeneous  col¬ 
lection  of  machines. 

7.  Project  status  and  future  work 

The  current  p2d2  system  has  been  demonstrated  on  sev¬ 
eral  target  architectures  and  has  been  used  to  debug  both 
MPI  and  PVM  applications.  After  the  recent  work  to 
accommodate  Globus  computations,  it  has  been  success¬ 
fully  used  to  control  128  processes  running  on  3  different 
SGI  Origins  on  the  IPG.  It  has  also  been  used  on  heteroge¬ 
neous  computations  running  under  Globus  (see  Figure  9). 

At  the  time  of  writing  this  paper,  we  have  requested  per¬ 
mission  from  NASA  to  distribute  p2d2  under  an  Open 
Source  copyright  [17].  Those  desiring  up  to  date  informa¬ 
tion  about  the  status  of  that  distribution  are  requested  to 
consult  the  p2d2  web  site  [18]. 

In  the  near  future,  we  will  start  using  the  Metacomput¬ 
ing  Directory  Service  (MDS)  in  Globus  to  record  informa¬ 
tion  about  jobs  started  outside  the  debugger.  This  will 
enable  us  to  attach  to  Globus  computations  without  relying 
on  the  target  systems  sharing  a  file  system  with  the  debug¬ 
ger  host. 

Further  in  the  future  we  may  adapt  p2d2  to  work  with 
Legion  [14]  and  Condor  [4]  if  there  is  sufficient  user 
demand.  We  also  plan  to  enhance  p2d2  to  find  differences 
between  serial  and  distributed  versions  of  the  same  code. 
This  could  be  particularly  useful  when  computer-aided  par- 
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allegation  tools  such  as  CAPTools  are  used  to  perform  the 
transformation. 

8.  Conclusions 

In  this  paper  we  have  described  a  debugger  for  hetero¬ 
geneous,  distributed  programs.  We  found  that  a  client- 
server  model  greatly  simplifies  the  implementation  of  a 
debugger.  The  debugger’s  user  interface  has  been  designed 
to  provide  a  simple  collective  mechanism  for  process  con¬ 
trol,  as  well  as  multiple  levels  of  zooming  for  state  exami¬ 
nation.  These  features  facilitate  the  debugging  of  a 
computation  containing  a  large  number  of  processes.  We 
also  described  several  approaches  for  finding  processes 
participating  in  a  distributed  computation  and  how  those 
techniques  could  be  used  in  a  computational  grid  environ¬ 
ment. 
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Abstract 

It  is  often  the  case  in  Heterogeneous  Computing  (HC) 
systems  that  an  application  requires  multiple  resources  of 
different  types  to  be  allocated  simultaneously .  In  general, 
this  problem  is  the  resource  co-allocation  problem.  In  this 
paper,  we  develop  a  general  framework  for  mapping  a  col¬ 
lection  of  applications  with  resource  co-allocation  require¬ 
ments .  In  our  framework,  application  tasks  have  two  types 
of  constraints  to  be  satisfied:  precedence  constraints  and  re¬ 
source  sharing  constraints.  We  use  a  graph  theoretic  frame¬ 
work  to  capture  these  constraints.  A  Directed  Acyclic  Graph 
is  used  to  represent  precedence  constraints  of  tasks  within 
an  application  and  a  Compatibility  Graph  is  used  to  repre¬ 
sent  resource  sharing  constraints  among  tasks  of  applica¬ 
tions.  Both  these  graphs  are  used  to  find  maximal  indepen¬ 
dent  sets  of  tasks  that  can  be  executed  concurrently. 

The  objective  of  the  mapping  is  to  minimize  the  overall 
schedule  length  for  a  given  set  of  applications.  We  develop 
heuristic  algorithms  to  solve  the  mapping  problem  with  re¬ 
source  co-allocation  constraints.  We  also  provide  a  two- 
phase  algorithm  that  can  be  used  for  run-time  adaptation. 
We  conducted  extensive  simulation  experiments  to  evaluate 
the  performance  of  our  heuristic  algorithms.  Simulation  re¬ 
sults  for  our  algorithms  show  a  performance  improvement 
of  10%  to  30%  over  a  baseline  algorithm  of  list  schedul¬ 
ing  which  considers  only  the  precedence  constraints  and  al¬ 
locates  tasks  from  the  resulting  order.  This  paper  demon¬ 
strates  the  importance  of  considering  the  co-allocation  re¬ 
quirements  when  mapping  applications  in  heterogeneous 
computing  environments  including  grid  environments. 
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1.  Introduction 

In  Heterogeneous  Computing  (HC)  systems  [8,  13,  20, 
25,  26],  a  diverse  set  of  resources  are  used  in  a  coordinated 
and  effective  way  to  solve  computationally  challenging  ap¬ 
plications.  Such  systems  are  also  called  metacomputing  sys¬ 
tems  [29]  or  computational  grids  [10].  In  general,  such  HC 
systems  have  compute  resources  with  different  capabilities, 
input/output  devices,  data  repositories,  and  other  resources 
all  interconnected  by  heterogeneous  local  and  wide  area  net¬ 
works.  A  major  challenge  in  using  HC  systems  is  to  effec¬ 
tively  use  all  the  available  resources. 

Mapping  applications  in  HC  system  is  a  well  researched 
problem  in  the  literature.  The  mapping  problem  is  defined 
as  the  problem  of  assigning  application  tasks  to  suitable  re¬ 
sources  (matching  problem)  and  ordering  task  executions 
in  time  (scheduling  problem)  to  optimize  a  specific  objec¬ 
tive  function.  Many  algorithms  exist  for  mapping  applica¬ 
tions  in  HC  systems  (for  a  detailed  classification  see  [4]). 
For  applications  consisting  of  several  tasks  and  represented 
by  Directed  Acyclic  Graphs  (DAGs),  many  static  and  dy¬ 
namic  mapping  algorithms  have  been  proposed.  Dynamic 
algorithms  include  the  Hybrid  Remapper  [23],  the  Genera¬ 
tional  algorithm  [12],  as  well  as  others  [1,  18,  21].  Several 
static  algorithms  for  mapping  application  DAGs  in  HC  sys¬ 
tems  are  described  in  [19,  24,  27,  32].  Most  of  the  previous 
algorithms  focus  on  compute  resources  only. 

In  our  earlier  work  [2],  we  developed  a  unified  resource 
scheduling  framework  for  HC  systems  which  supports  mul¬ 
tiple  resource  requirements,  advance  reservation,  and  data 
replication.  Each  application  was  assumed  to  consist  of  sev¬ 
eral  tasks  and  was  represented  by  a  DAG.  A  task’s  input  data 
can  be  data  items  from  its  predecessors  and/or  data  sets  from 
data  repositories.  Input  data  sets  can  be  accessed  from  one 
or  more  data  repositories.  Sources  of  input  data  and  the  ex¬ 
ecution  times  of  the  tasks  on  various  machines  along  with 
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their  availability  were  considered  simultaneously  to  mini¬ 
mize  the  overall  completion  time.  Although  we  considered 
multiple  resource  requirements  in  [2],  tasks  were  not  re¬ 
quired  to  access  different  types  of  resources  simultaneously. 

In  this  paper,  we  consider  the  problem  of  mapping  a  set  of 
applications  in  a  HC  system  where  application  tasks  require 
concurrent  access  to  multiple  resources  of  different  types. 
In  general,  this  problem  is  the  resource  co-allocation  prob¬ 
lem.  For  example,  an  interactive  data  analysis  application 
may  require  simultaneous  access  to  a  storage  system  hold¬ 
ing  a  copy  of  the  data,  a  supercomputer  for  analysis,  network 
elements  for  data  transfer,  and  a  display  device  for  interac¬ 
tion  [1 1].  For  such  applications,  co-allocation  of  all  required 
resources  is  necessary.  A  special  case  of  this  problem  where 
a  single  application  requires  concurrent  access  to  a  set  of  re¬ 
sources  in  a  computational  grid  has  been  considered  in  [5]. 

In  this  paper,  we  develop  a  general  framework  for  map¬ 
ping  with  resource  co-allocation  in  HC  systems.  The  frame¬ 
work  defines  the  system  and  application  models  and  formu¬ 
lates  the  co-allocation  problem.  Two  graphs  are  used  to  rep¬ 
resent  applications:  a  Directed  Acyclic  Graph  (DAG)  and 
a  Compatibility  Graph  (defined  in  Section  3.4).  DAG  rep¬ 
resentation  is  given  initially  and  stay  unchanged  through¬ 
out  the  mapping  process  while  the  compatibility  graph  is 
updated  during  the  mapping  process.  In  classical  mapping 
problems,  only  DAGs  are  used  to  represent  the  precedence 
constraints  among  tasks.  In  this  paper,  the  co-allocation  re¬ 
quirements  add  another  type  of  constraint  among  the  tasks: 
the  resource  sharing  constraint  which  is  captured  in  the  com¬ 
patibility  graph.  Tasks  that  share  one  or  more  resources 
cannot  be  executed  concurrently  due  to  the  resource  shar¬ 
ing  constraints  even  if  they  have  no  precedence  constraints 
among  them.  Known  mapping  algorithms  for  the  classi¬ 
cal  DAG  scheduling  problem  cannot  be  directly  used  for 
the  above  problem  since  they  only  consider  the  precedence 
constraints.  In  this  paper,  we  develop  heuristic  algorithms 
that  can  be  used  with  different  allocation  techniques  to  ef¬ 
ficiently  solve  the  co-allocation  problem  defined  by  our 
framework. 

In  our  approach,  multiple  DAGs  of  different  applications 
are  combined  into  a  single  DAG.  All  tasks  that  have  satisfied 
the  precedence  constraints  are  ready  for  allocation  provided 
they  have  no  resource  sharing  constraints.  Using  the  com¬ 
patibility  graph,  we  will  select  tasks  that  can  be  executed 
concurrently.  This  is  achieved  by  finding  maximal  indepen¬ 
dent  sets  in  the  compatibility  graph. 

Our  research  is  part  of  the  MSHN  project  [16],  which 
is  a  collaborative  effort  between  DoD  (Naval  Postgraduate 
School),  academia  (NPS,  USC,  Purdue  University),  and  in¬ 
dustry  (NOEMIX).  MSHN  (Management  System  for  Het¬ 
erogeneous  Networks)  is  designing  and  implementing  a  Re¬ 
source  Management  System  (RMS)  for  distributed  hetero¬ 
geneous  and  shared  environments.  MSHN  assumes  hetero¬ 


geneity  in  resources,  processes,  and  QoS  requirements.  Pro¬ 
cesses  may  have  different  priorities,  deadlines,  and  com¬ 
pute  characteristics.  The  goal  is  to  schedule  shared  re¬ 
sources  among  individual  applications  so  that  their  Quality 
of  Service  (QoS)  requirements  are  satisfied.  MSHN  sup¬ 
ports  adaptive  applications  that  can  exist  in  several  different 
versions.  These  versions  may  differ  in  the  precision  of  com¬ 
putation  or  input  data,  and  therefore  have  different  values  to 
a  user.  Unlike  other  HC  and  grid  projects,  MSHN  seeks  to 
determine  how  to  meet  QoS  requirements  of  multiple  appli¬ 
cation  simultaneously. 

The  rest  of  this  paper  is  organized  as  follows.  In  next  sec¬ 
tion  we  give  the  definition  of  the  co-allocation  problem  and 
summarize  some  related  work.  The  problem  framework  is 
defined  in  Section  3.  In  Section  4,  we  give  the  outline  of 
our  approach  to  solve  the  co-allocation  problem  using  our 
framework.  Experimental  results  are  given  in  Section  5.  Fi¬ 
nally,  Section  6  gives  the  conclusions  and  future  research  di¬ 
rections. 

2.  The  Co- Allocation  Problem 

The  co-allocation  problem  can  be  defined  as  the  problem 
of  simultaneously  allocating  multiple  resources  of  different 
types  to  applications  in  order  to  meet  specific  performance 
requirements.  The  need  of  co-allocation  is  a  common  char¬ 
acteristic  of  applications  running  in  HC  environments  (as 
well  as  computational  grids).  For  example,  an  application 
may  require  a  data  repository,  a  HPC  platform,  multiple  dis¬ 
play  devices,  and  communication  links  all  to  be  allocated  si¬ 
multaneously. 

A  version  of  resource  co-allocation  has  been  introduced 
in  the  high-performance  distributed  computing  community 
by  the  Globus  project  [5].  The  co-allocation  problem  is  de¬ 
fined  as  the  provision  of  allocation,  configuration,  and  mon¬ 
itoring/control  functions  for  the  resource  ensemble  required 
by  a  single  application  [5].  The  Globus  tool-kit  provides  a 
flexible  set  of  co-allocation  mechanisms  that  can  be  used  to 
construct  application-specific  co-allocation  strategies.  Only 
compute  resources  are  considered  in  the  Globus  project  at 
this  time,  to  synchronize  the  start  of  complex  applications 
at  multiple  sites. 

The  notion  of  co-allocation  was  also  considered  in  the 
Legion  project  [22].  In  the  Legion  project,  an  Enactor  pro¬ 
vides  a  mechanism  to  co-allocate  compute  and  storage  re¬ 
sources  (hosts  and  vaults)  to  a  single  application.  The  co¬ 
allocation  mechanism  is  based  on  advance  resource  reserva¬ 
tion. 

In  [5]  and  [22],  the  focus  is  on  implementation  issues  of 
the  co-allocation  process.  Algorithms  for  efficient  mapping 
with  co-allocation  requirements  are  not  considered.  Also, 
both  the  above  projects  focus  on  executing  a  single  appli¬ 
cation.  The  problem  becomes  challenging  when  a  number 
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of  applications  share  resources. 

In  this  paper,  we  study  the  co- allocation  problem  in  the 
context  of  mapping  a  set  of  applications  where  each  applica¬ 
tion  is  represented  by  a  DAG.  We  consider  conflicts  among 
tasks  caused  by  precedence  constraints  as  well  as  due  to  re¬ 
source  sharing.  The  objective  is  to  minimize  the  overall 
schedule  length  for  a  set  of  applications.  One  of  our  main 
contributions  in  this  paper  is  the  formulation  of  the  map¬ 
ping  problem  in  the  presence  of  co-allocation  requirements 
for  multiple  applications.  To  the  best  of  our  knowledge, 
this  work  is  the  first  step  towards  a  general  framework  for 
mapping  applications  with  resource  co-allocation  in  HC  sys¬ 
tems. 

3.  The  Framework 
3.1  System  Model 

We  consider  a  heterogeneous  computing  system  with 
m  compute  resources  (machines),  M  ={mi,  m2, . . . ,  ram}, 
and  a  set  of  r  resources,  R  ={ri ,  7*2, . . . ,  rr  }.  Compute  re¬ 
sources  can  be  HPC  platforms,  workstations,  personal  com¬ 
puters,  etc.  Resource  r*  6  R  can  be  a  data  repository, 
an  input/output  device,  etc.  We  assume  that  only  one  task 
can  use  any  resource  (compute  and  non-compute  resource) 
at  any  given  time.  Resources  are  interconnected  by  het¬ 
erogeneous  communication  links.  Communication  costs 
are  given  by  two  matrices:  MM  .comm  and  RM  .comm, 
where  MM  jcomm  gives  the  communication  cost  for  trans¬ 
ferring  a  byte  between  machines  and  RM .comm  gives  the 
communication  cost  for  transferring  a  byte  between  the  re¬ 
sources  in  R  and  the  machines. 

We  assume  that  an  estimate  of  the  computation  time  of 
a  given  task  on  machine  mj  is  available  at  compile¬ 
time.  These  estimated  computation  times  are  given  in 
an  Estimated  Computation  Time  ( ECT )  matrix.  Thus, 
ECT(ti ,  mj)  gives  the  estimated  computation  time  for  task 
ti  on  machine  mj .  If  task  ti  cannot  be  executed  on  machine 
mj ,  then  ECT(ti,mj)  is  set  to  infinity. 

MA(mj)  gives  the  earliest  time  when  machine  mj  is 
available  and  RA(rk)  gives  the  earliest  time  when  resource 
r*  is  available.  As  the  mapping  proceeds,  the  earliest  time 
when  a  resource  (mj  or  r*)  is  available  is  calculated  as  the 
finish  time  of  the  last  task  assigned  to  this  resource. 

3.2.  Application  Model 

In  this  HC  system,  a  set  of  N  applications, 
A~{A\, . . . ,  An},  compete  for  system  resources.  Each 
submitted  application  consists  of  several  tasks  and  is  mod¬ 
eled  by  a  DAG,  where  the  nodes  represent  computational 
requirements  and  the  edges  represent  both  precedence  con¬ 
straints  and  communication  requirements.  Figure  1  shows 


Application  1  Application  2 


Figure  1 .  An  example  of  two  application  DAGs 


an  example  of  two  application  DAGs.  We  assume  that  the 
whole  set  of  applications  to  be  mapped  is  known  apriori 
(static  applications).  The  problem  is  to  execute  these  N 
applications  as  efficiently  as  possible.  Our  approach  is 
to  combine  all  submitted  application  DAGs  into  a  single 
DAG,  G  —  (T,  E ),  where  T  represents  the  set  of  tasks  to 
be  executed  from  all  applications,  T={ti,t2, . . .  ,fn},  and 
E  represents  the  data  dependencies  and  communication 
between  tasks.  Edge  t\j  indicates  that  there  is  communi¬ 
cation  from  task  ti  to  tj  and  its  weight  denotes  the  amount 
of  communication.  G  is  constructed  by  connecting  the  root 
nodes  (tasks)  of  ail  applications  to  a  hypothetical  zero-cost 
entry  node  with  zero- weight  edges. 

We  assume  that  each  task  ti  needs  concurrent  access  to 
a  set  of  resources:  one  compute  resource  mj  and  a  number 
of  additional  resources  as  specified  by  the  set  R(U),  where 
R(ti)  C  R .  The  amount  of  data  to  be  transferred  between 
ti  and  r*,  where  r*  6  R(t%),  is  given  by  DATA(U,  r*).  A 
task  ti  cannot  start  execution  until  all  its  required  resources 
are  available  to  the  task.  All  required  resources  will  be  allo¬ 
cated  to  the  task  during  its  execution.  These  resources  will 
be  available  after  the*task  completes  its  execution.  We  as¬ 
sume  that  all  required  resources  are  acquired  at  the  same 
time  (atomic  transaction). 

We  say  that  task  ti  and  task  tj  are  incompatible  if  and 
only  if  R(ti)  fl  R{tj)  ^  <j).  Incompatible  tasks  cannot 
be  executed  concurrently  even  if  they  have  no  precedence 
constrains  among  them.  Therefore,  in  our  framework,  tasks 
may  be  unable  to  run  concurrently  for  either  of  the  following 
reasons:  (1)  precedence  constraints,  or  (2)  resource  sharing 
constraints. 
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The  execution  time  of  task  ti  on  machine  mj, 
Exec(tj,mj)y  depends  on  the  computation  time  of  ti 
on  rrij  and  data  transfer  times  between  mj  and  all  resources 
which  U  needs  to  access  during  its  execution.  For  example, 
for  systems  that  assume  computation  and  communication 
cannot  be  overlapped,  Exec(ti,  mj)  can  be  defined  as 

Exec(ti,rrij)  =  ECT(ti,mj)-\- 
ErfeGH(t1)(Di4Tj4^i’r;c)  X 

where  the  last  term  gives  the  total  time  to  transfer  any 
required  data  between  machine  rrij  and  every  resource 
rk  g  R{U).  Exec{ti,mj)  can  also  be  defined  in  different 
ways  to  consider  the  overlapping  between  computation  and 
communication  as  well  as  other  communication  models. 

The  average  execution  time  of  task  tj  is  defined  as 


Exec(ti)  =  y^uExec(tj,mj)/m 
j- 1 

ST(ti,rrij)  and  FT(ti,rrij)  are  the  earliest  start  time  and 
the  earliest/misft  time  of  task  t{  on  machine  rrij ,  respectively 
if  ti  were  to  be  mapped  on  mj .  ST(t{ ,  mj  )  is  defined  as 

ST(ti,rrij)  ~  mzix{MA(mj),Data-Pred(ti,mj)} 

where  Data.Pred(ti,mj)  is  the  time  when  task  U  receives 
all  the  needed  data  from  all  tasks  in  its  predecessor  set, 
Pre(ti ),  if  t{  is  mapped  onto  machine  mj.  FT(ti,rrij)  is 
defined  as 


FT(U,mj)  =  ST  (ti,  rrij)  +  Exec(U,  rrij) 


3.3.  Mapping  Objective 

The  objective  function  in  our  framework  is  to  determine 
an  assignment  (matching)  of  tasks  to  compute  resources  and 
schedule  their  executions  based  on  all  resource  requirements 
such  that  the  overall  schedule  length  (or  makespan)  of  all 
submitted  applications  is  minimized  while  satisfying  all 

1.  Application-specified  precedence  constraints  and 

2.  Implied  resource  sharing  constraints. 

Thus,  we  can  define  our  objective  function  as 

Minimize  {max  [Finish  Time(Ai)]  }, 

*=i 

where  Finish  Time(Ai)  is  the  completion  time  of  appli¬ 
cation  At.  Note  that  the  resource  sharing  constraint  is  a  dy¬ 
namic  constraint  -  it  depends  on  tasks  ready  to  be  allocated 
and  their  resource  requirements. 


Task 

Resource  Requirements 

Vi 

ru 

v2 

V3 

V4 

ri,  r4 

V5 

U,  r5,  r6 

V6 

re 

Table  1 .  An  example  showing  6  tasks  and  their 
resource  requirements 


Figure  2.  The  compatibility  graph  for  the  tasks 
shown  in  Table  1 


Task 

Execution  Time 

Vi 

5 

V2 

6 

V3 

2 

V4 

4 

V5 

1 

V6 

3 

Table  2.  Execution  times  for  the  tasks  in  Fig¬ 
ure  2 
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3.4.  Compatibility  Graph 

To  capture  the  implied  resource  sharing  constraints 
among  tasks  that  may  belong  to  the  same  or  different  appli¬ 
cations,  we  use  the  compatibility  graph,  g  =  (V,E),  where 
vertex  v*  denotes  task  ti  and  edge  e,-j  exists  if  and  only  if  ti 
and  tj  are  incompatible.  Recall,  task  U  and  task  tj  are  in¬ 
compatible  if  and  only  if  R(ti)  fl  R(tj)  /  <f>.  An  indepen¬ 
dent  set  [6]  is  a  set  of  vertices  of  g  such  that  no  two  vertices 
of  the  set  are  adjacent.  An  independent  set  is  called  a  maxi¬ 
mal  independent  set  if  there  is  no  other  independent  set  of  g 
that  contains  it.  A  maximal  independent  set  with  the  largest 
number  of  vertices  among  all  maximal  independent  sets  is 
called  a  maximum  independent  set  [6].  The  maximum  inde¬ 
pendent  set  problem  is  NP-complete  [15].  In  our  model,  a 
maximal  independent  set  of  g  represents  a  maximal  set  of 
tasks  that  can  be  executed  concurrently  if  there  is  no  prece¬ 
dence  constraints  among  them. 

As  an  example,  consider  a  set  of  6  independent  tasks. 
Each  task  needs  concurrent  access  to  a  set  of  resources  as 
specified  in  Table  1 .  The  compatibility  graph  g  for  this  ex¬ 
ample  is  shown  in  Figure  2.  The  maximal  independent  sets 
of  <7  are  {VlfV5},  {V2,V5},  {Vi.Va.Ve},  {V2,V4,V6},  and 
{V3,V4,Ve}.  The  last  three  sets  are  maximum  independent 
sets. 

4.  Our  Solution 

In  classical  DAG  scheduling  problem,  application  DAGs 
are  partitioned  onto  levels  such  that  each  level  contains  inde¬ 
pendent  tasks,  i.e.,  there  are  no  data  dependencies  among  the 
tasks  in  the  same  level.  Therefore,  all  tasks  in  the  same  level 
can  be  executed  concurrently.  In  our  framework,  incom¬ 
patible  tasks  in  the  same  level  cannot  be  executed  concur¬ 
rently  due  to  resource  sharing  constraints.  Therefore,  map¬ 
ping  algorithms  for  the  classical  DAG  scheduling  problem 
(ex.  [1,  30, 18, 23, 9, 31,  32])  cannot  be  directly  used  for  our 
problem. 

In  this  section,  we  develop  a  static  co-allocation  algo¬ 
rithm  using  the  framework  defined  in  Section  3.  The  algo¬ 
rithm  can  be  used  with  different  maximal  independent  set  se¬ 
lection  strategies  and  different  allocation  heuristics  to  solve 
the  mapping  problem  with  co-allocation  requirements.  Sev¬ 
eral  strategies  for  selecting  maximal  independent  sets  and 
several  allocation  heuristics  are  given  in  this  section.  Also, 
we  provide  a  two-phase  algorithm  that  performs  run-time 
adaptation. 

4.1.  The  Co-Allocation  Algorithm 

Pseudo  code  for  our  co-allocation  algorithm  is  shown  in 
Figure  3.  Given  a  set  of  applications  and  resource  require¬ 
ments  of  tasks,  we  first  find  tasks  that  have  satisfied  prece¬ 


dence  constraints  and  then  select  maximal  independent  sets 
among  these  for  allocation.  The  compatibility  graph  is  used 
to  find  maximal  independent  sets.  Since  the  maximum  inde¬ 
pendent  set  problem  is  NP-complete  [15],  our  approach  for 
selecting  a  maximal  independent  set  is  based  on  first  choos¬ 
ing  a  critical  node  vc ,  and  then  finding  a  maximal  indepen¬ 
dent  set  that  contains  vc.  Different  strategies  for  selecting 
critical  nodes  are  given  in  Section  4.2. 

To  ensure  precedence  constrains  are  satisfied,  we  com¬ 
bine  all  submitted  applications  into  a  single  DAG,  G,  by  us¬ 
ing  zero-weight  edges  to  connect  the  root  nodes  (tasks)  of  all 
applications  to  a  hypothetical  zero-cost  entry  node.  Then  we 
partition  the  combined  DAG  onto  /  levels  such  that  level  0 
contains  the  entry  node  and  level  1  contains  all  tasks  that  do 
not  have  any  predecessors  in  the  submitted  DAGs.  All  tasks 
in  level  l  have  no  successors.  For  each  task  ti  in  level  k,  all 
of  its  predecessors  are  in  levels  0  to  k  —  1,  and  at  least  one 
of  them  in  level  k- 1. 

Let  RDY  be  the  set  of  tasks  that  have  no  precedence  con¬ 
straints  among  them  and  that  are  ready  for  allocation.  A  task 
is  ready  for  allocation  if  for  each  predecessor  all  required  re¬ 
sources  have  been  allocated.  Let  W  be  the  set  of  ready  tasks 
that  are  waiting  for  allocation,  and  ALLOCATED  be  the 
set  of  allocated  tasks.  After  executing  the  algorithm,  the  list 
SC  HE  DU  LE  will  give  the  resulting  scheduling  order  of 
tasks  and  the  variable  length  will  give  the  resulting  sched¬ 
ule  length. 

In  steps  1  and  2  of  the  algorithm,  we  combine  all  submit¬ 
ted  applications  into  a  single  DAG,  G ,  and  partition  G  into 
l  levels.  Then  the  algorithm  proceeds  level-by-level  as  fol¬ 
lows.  For  each  level  /  of  G ,  we  construct  the  compatibility 
graph  g  for  all  tasks  in  this  level  (step  6).  g  is  used  to  find 
maximal  independent  sets  of  tasks  that  can  be  executed  con¬ 
currently.  The  first  maximal  independent  set  of  tasks  to  be 
allocated  is  selected  in  steps  7-8  where  a  critical  node  vc  is 
chosen  in  step  7  and  a  maximal  independent  set  that  contains 
vc  is  determined  in  step  8. 

In  step  10,  all  tasks  in  the  selected  maximal  indepen¬ 
dent  set  are  allocated  to  their  required  resources.  For  the 
allocation,  we  first  find  the  scheduling  order  of  the  tasks. 
Several  heuristics  are  given  in  Section  4.3.  Then,  we  use 
this  scheduling  order  to  assign  a  compute  resource  mj  to 
each  task  ti  in  order  to  minimize  its  finish  time  FT(t mj). 
Availability  times  ( MA(mj )  and  RA(rk))  of  all  resources 
required  by  task  ti  are  updated  based  on  FT(ti,  mj). 

In  steps  12-16,  a  new  set  of  maximal  independent  tasks 
among  all  waiting  tasks  is  selected  to  be  allocated  at  the 
next  allocation  event.  The  next  allocation  event  is  calcu¬ 
lated  as  the  earliest  finish  time,  FT(L,  mj),  among  all  al¬ 
located  tasks.  An  allocated  task  vx  with  the  earliest  fin¬ 
ish  time  is  identified  in  step  12  and  then  removed  from  the 
ALLOCATED  set.  Initially  (step  14),  the  set  of  candi¬ 
date  tasks  that  can  be  allocated  next  ,C,  contains  all  waiting 
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Inputs:  application  DAGs,  estimated  computation  and  communication  costs,  and  resource  requirements  of  tasks 
Outputs:  the  scheduling  order  of  tasks  {SC  HE  DU  LE)  and  the  schedule  length  (makespan)  ( lenghi )  based  on  the  given 
inputs 

Begin 

1 .  Combine  all  submitted  application  DAGs  into  a  single  DAG  ( G ) 

2.  Do  level  partitioning  of  G  f*  tasks  in  each  level  have  no  precedence  constraints  */ 

3.  Let  SC  HE  DU  LE  =  (f>  and  lenghi  =  0 

4.  For  level  1  to  /  do 

5.  Initialize  W  to  include  all  tasks  in  the  current  level  and  let  ALLOCATED  =  <j> 

6.  Construct  the  compatibility  graph  g  for  all  tasks  in  the  current  level 

7.  Pick  a  critical  node  vc  from  W  /*  several  strategies  can  be  used  for  critical  node  selection*/ 

8.  Find  a  maximal  independent  set  of  tasks  S  from  W  such  that  vc  E  S  /*  g  is  used  to  find  the  maximal  independent  set  */• 

9.  While  W  is  not  empty  do 

10.  Allocate  all  tasks  in  S  to  their  required  resources  by  doing  the  following  two  steps: 

10a.  Find  the  scheduling  order  of  the  tasks  and  add  them  to  SC H  EDU LE  /*  different  heuristics  can  be  used  */ 

10b.  For  each  task  ti  in  S  (in  the  scheduling  order)  do 

-  Assign  a  compute  resource  mj  to  task  L  in  order  to  minimize  its  finish  time  FT(fj,  mj) 

-  Update  MA(mj)  and  iL4(r/t),  Vr^  E  /?(/,),  based  on  FT(U,  mj) 

-  If  (  FT{ti,mj)  >  lenghi)  then  lenght=FT(ti,  mj) 

11.  Add  all  tasks  in  S  to  ALLOCATED  and  remove  them  from  W 

12.  Let  vx  be  the  allocated  task  with  the  lowest  finish  time 

13.  Remove  vx  from  ALLOCATED 

14.  Let  C  =  Wj  where  C  is  the  set  of  candidate  tasks  that  can  be  allocated  next 

15.  Remove  all  tasks  from  C  that  are  incompatible  with  any  allocated  task 

16.  If  (C±4>) 

16a.  Pick  a  critical  node  vc  from  C  such  that  vc  is  adjacent  to  vx  in  g 

16b.  Find  a  maximal  independent  set  of  tasks  S  from  C  such  that  vc  E  S 

17.  End  (while) 

18.  End  (for) 

End 


Figure  3.  The  co-allocation  algorithm 
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tasks.  The  candidate  set,  C,  is  updated  in  step  15  by  remov¬ 
ing  all  tasks  that  are  incompatible  with  any  allocated  task. 
Then  g  is  used  to  find  a  maximal  independent  set  of  tasks 
from  C.  The  algorithm  repeats  steps  10-16  until  all  tasks  in 
this  level  have  been  allocated. 

4.2.  Maxima]  Independent  Set  Selection 

Since  the  maximum  independent  set  problem  is  NP- 
complete  [15],  we  use  a  heuristic  approach  for  selecting 
maximal  independent  sets.  Our  approach  is  based  on  first 
selecting  a  critical  node  vCl  and  then  finding  a  maximal  in¬ 
dependent  set  that  contains  vc.  Critical  nodes  need  to  be  se¬ 
lected  carefully. 

The  length  of  the  schedule  is  influenced  by  the  selection 
of  maximal  independent  sets  and  by  the  order  in  which  these 
sets  are  considered  for  scheduling.  This  is  shown  in  the  fol¬ 
lowing  example  using  the  compatibility  graph  in  Figure  2. 
For  this  example,  for  the  sake  of  simplicity,  we  assume  that 
all  the  resource  requirements  of  tasks  (compute  and  non¬ 
compute  resources)  are  pre-specified.  Therefore,  the  execu¬ 
tion  times  of  all  tasks  are  known  apriori.  These  times  are 
shown  in  Table  2.  Example  schedules  are  given  in  Figures  4- 
6.  These  schedules  have  different  schedule  lengths.  The  op¬ 
timal  length  of  the  schedule  is  1 1  time  units.  This  is  achieved 
by  schedules  2  and  3.  In  schedules  1  and  2,  two  different 
maximum  independent  sets  were  selected  to  be  scheduled 
first.  Schedule  1  has  a  length  of  1 3  time  units,  while  sched¬ 
ule  2  has  the  optimal  length.  This  clearly  shows  the  impor¬ 
tance  of  the  order  in  which  the  maximal  independent  sets  are 
considered  for  scheduling.  From  schedule  3,  we  can  also  see 
that  it  is  not  always  efficient  to  select  a  maximum  indepen¬ 
dent  set  to  be  scheduled  first.  Schedule  3,  which  starts  with 
a  maximal  independent  set  {V2,Vs}  (not  a  maximum  inde¬ 
pendent  set),  has  the  optimal  length  while  schedule  1 ,  which 
starts  with  a  maximum  set  {V3,V4,V6},  has  a  non-optimal 
length  of  13  time  units. 

The  idea  behind  our  approach  for  selecting  a  maximal  in¬ 
dependent  set  S  is  to  select  a  critical  vertex  vc  and  add  it  to 
S  which  is  initially  empty.  Then  we  attempt  to  enlarge  S  by 
traversing  g.  Different  strategies  can  be  used  for  selecting 
critical  vertices.  In  the  following  we  describe  some  of  these 
strategies. 

51  Highest  average  execution  time.  In  this  strategy,  we 

give  priority  for  tasks  that  need  more  time  for  execution 
since  they  can  be  critical  tasks.  In  HC  systems,  tasks 
have  different  execution  times  on  different  machines. 
Therefore,  we  use  the  average  execution  time  Exec(ti) 
as  the  selection  criterion. 

52  Highest  degree.  The  node  out-degree  in  a  DAG  has 

been  used  in  many  list  scheduling  heuristics  as  a  prior¬ 
ity  function.  The  out-degree  of  a  node  ti  gives  the  num¬ 


ber  of  tasks  that  have  precedence  constraints  with  U. 
The  idea  is  to  advance  the  execution  of  tasks  with  high 
out-degree.  Thus,  many  tasks  can  be  ready  for  mapping 
once  high  out-degree  tasks  complete  execution.  In  our 
framework,  the  out-degree  of  node  U  in  the  combined 
DAG  G  (which  represents  task  ti)  does  not  reflect  all 
dependencies  between  ti  and  other  tasks  since  G  only 
captures  the  precedence  constraints.  Resource  sharing 
constraints  should  also  be  considered.  Therefore,  we 
define  the  degree  of  task  ti  as  the  sum  of  its  out-degree 
in  G  and  its  degree  in  g.  This  number  gives  a  better 
indication  about  the  number  of  tasks  that  can  be  ready 
for  mapping  once  U  completes  its  execution,  either  be¬ 
cause  those  tasks  have  a  precedence  or  resource  sharing 
dependencies  with  ti. 

53  Critical  path  nodes.  A  Critical  Path  (CP)  in  a  DAG  is  a 

path  from  an  entry  node  to  an  exit  node  with  the  largest 
completion  time.  We  use  average  execution  times  and 
average  communication  costs  to  find  the  critical  path. 
In  some  situations,  the  average  execution  time  or  the 
degree  of  a  task  ti  cannot  reflect  how  important  for 
other  tasks  that  ti  finishes  execution  as  soon  as  possi¬ 
ble.  The  successors  of  ti  may  not  be  critical  tasks  and 
advancing  their  execution  may  not  improve  the  sched¬ 
ule  length.  For  these  reasons,  selecting  critical  path 
nodes  from  G  as  critical  tasks  can  be  a  good  strategy. 
In  this  paper,  we  implement  two  variations  of  this  strat¬ 
egy: 

53.1  In  this  version,  the  task  that  is  on  the  critical  path 
is  selected  as  a  critical  task.  If  there  is  no  such 
task  among  the  current  set  of  candidate  tasks,  the 
task  with  the  highest  average  execution  time  is  se¬ 
lected  as  a  critical  task. 

53.2  This  version  is  similar  to  S3.1  except  for  the  case 
when  there  is  no  critical  path  node  among  the  cur¬ 
rent  set  of  candidate  task.  In  this  case,  the  task 
with  the  highest  degree  is  selected  as  a  critical 
task. 

54  Maximum  weighted  clique.  In  [3],  a  similar  approach 

to  the  compatibility  graph  has  been  used  for  schedul¬ 
ing  independent  tasks.  Each  task  in  [3]  requires  si¬ 
multaneous  access  to  a  set  of  pre-specified  processors. 
All  resource  (processor)  requirements  and  all  execu¬ 
tion  times  were  assumed  to  be  known  apriori.  It  has 
been  shown  in  [3]  that  the  weight  of  the  maximum 
weighted  clique  in  the  constraint  graph  (compatibil¬ 
ity  graph)  is  a  lower  bound  on  the  optimum  makespan, 
where  the  weight  of  each  node  is  its  execution  time. 
In  our  previous  example,  notice  that  the  weight  of  the 
maximum  weighted  clique  (V\  and  V2)  is  11  and  the 
optimal  schedule  length  is  11.  Also  notice  that  any  se- 
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lected  maximal  independent  set  should  contain  a  task 
that  belongs  to  the  maximum  weighted  clique  in  or¬ 
der  to  achieve  the  optimal  schedule  length.  Inspired 
by  this  observation,  we  can  use  nodes  in  the  maxi¬ 
mum  weighted  clique  as  candidates  for  selecting  crit¬ 
ical  tasks.  In  our  approach,  we  use  the  average  execu¬ 
tion  times  as  the  node  weights.  It  is  obvious  that  in  our 
model,  maximum  weighted  clique  cannot  guarantee  the 
optimal  solution  but  it  could  be  a  good  heuristic  for  se¬ 
lecting  maximal  independent  sets. 

4.3.  Allocation  Heuristics 

After  selecting  a  maximal  independent  set  of  tasks,  care¬ 
ful  allocation  of  these  tasks  to  compute  resources  (ma¬ 
chines)  is  required  to  achieve  our  objective.  Different 
heuristic  can  be  used  for  allocating  tasks  of  the  selected 
maximal  independent  set  S  to  compute  resources.  In  the  fol¬ 
lowing  we  describe  some  of  our  allocation  heuristics.  The 
idea  behind  our  heuristics  is  to  advance  the  execution  of 
tasks  that  may  be  critical  in  order  to  minimize  the  overall 
schedule  length. 

1.  Highest  Average-Execution-Time  First  (HAETF). 

In  this  heuristic,  the  average  execution  time  is  used  as 
a  priority  function  to  place  tasks  in  a  list.  All  tasks  are 
placed  in  a  list  in  the  order  of  non-increasing  average 
execution  times.  Using  this  order,  each  task  is  allocated 
to  the  required  resources  such  that  its  finish  time  is  min¬ 
imized. 

2.  Maximum  Finish-Time  First  (MAX).  For  each  task, 
we  calculate  the  best  finish  time  that  can  be  achieved. 
Then  we  select  the  task  with  the  maximum  best  finish 
time  among  all  tasks.  The  task  is  allocated  its  required 
resources  such  that  its  finish  time  is  minimized.  We  re¬ 
peat  until  all  tasks  are  allocated. 

3.  Minimum  Finish-Time  First  (MIN).  This  heuristic 
is  similar  to  the  Maximum  Finish-Time  First  (MAX) 
heuristic  except  that  we  select  the  task  with  the 
minimum  finish  time  instead  of  selecting  the  task  with 
the  maximum  finish  time. 

4.  Highest  Degree  First  (HDF).  In  this  heuristic,  all  tasks 
are  placed  in  a  list  in  descending  order  according  to 
their  degrees  (ties  are  broken  arbitrarily).  Then,  tasks 
are  allocated  one-by-one  to  the  required  resources  such 
that  the  finish  time  for  each  task  is  minimized. 

4.4.  Two-Phase  Algorithm 

We  propose  a  two-phase  algorithm  for  run-time  adapta¬ 
tion  using  our  static  co-allocation  algorithm.  The  two-phase 


algorithm  can  be  used  for  the  problem  of  mapping  with  re¬ 
source  co-allocation  as  defined  in  our  framework  as  follows. 

Phase  1:  Compile-time  mapping.  At  this  phase,  the  co¬ 
allocation  algorithm  described  in  Section  4.1  is  used  to 
obtain  an  ordered  list  of  tasks.  The  order  of  tasks  in 
the  list  is  based  on  their  scheduling  order  as  produced 
by  our  co-allocation  algorithm.  The  list  is  obtained 
by  satisfying  all  precedence  and  resource  sharing  con¬ 
straints  with  the  objective  of  minimizing  the  overall 
schedule  length.  Estimated  computation  and  commu¬ 
nication  times  are  used  to  calculate  the  schedule  length. 

Phase  2:  Run-time  Adaptation.  Run-time  adaptation  can 
be  useful  for  the  cases  when  actual  execution  times  dif¬ 
fer  from  the  estimated  execution  times.  One  way  to 
consider  this  is  to  scan  through  the  ordered  list  obtained 
in  phase  1  once  a  task  completes  its  execution  in  or¬ 
der  to  find  all  tasks  that  can  be  executed  at  this  time 
and  make  local  reordering.  The  scanning  can  be  done 
through  a  window  of  tasks  with  specific  size  k ,  where 
k  >0. 

4.5.  Implementation  Issues 

The  focus  of  this  paper  is  the  mapping  problem  with  re¬ 
source  co-allocation  requirements  in  HC  systems.  The  im¬ 
plementation  details  for  the  co-allocation  process  are  out¬ 
side  the  scope  of  this  paper.  A  good  discussion  of  implemen¬ 
tation  issue  can  be  found  in  [5].  In  the  following  for  the  sake 
of  completeness,  we  briefly  state  our  assumptions  regarding 
the  co-allocation  implementation. 

We  assume  that  a  task  ti  cannot  start  execution  until  all 
its  required  resources  are  available.  These  resources  will  be 
acquired  at  the  same  time.  Once  a  task  ti  completes  its  ex¬ 
ecution,  all  its  allocated  resources  will  be  released  and  will 
be  available  for  other  tasks.  We  assume  that  any  allocation 
request  for  any  resource  will  be  granted  as  long  as  this  re¬ 
source  is  available.  In  this  paper,  we  do  not  consider  the 
cases  of  resource  failures  that  can  occur  in  the  HC  and  Grid 
environments. 

5.  Performance  Evaluation 

A  simulator  was  implemented  to  evaluate  the  perfor¬ 
mance  of  our  co-allocation  algorithms  and  the  proposed  se¬ 
lection  strategies  and  allocation  heuristics  discussed  in  Sec¬ 
tion  4.  In  this  section,  we  explain  our  simulation  procedure 
and  give  experimental  results. 

5.1.  Simulation  Procedure 

To  define  the  HC  system,  numbers  of  machines  and  re¬ 
sources  are  given  to  the  simulator  as  inputs.  Communica- 


tion  costs  among  all  resources  are  selected  randomly  from  a 
uniform  distribution  with  a  mean  equal  to  ave^comm.  The 
communication  costs  are  source  and  destination  dependent. 

The  workload  consists  of  randomly  generated  DAGs. 
Random  DAGs  are  generated  as  follows:  The  number 
of  tasks  in  the  graph,  noJasks,  maximum  out-degree  of 
a  node,  max-outdegree ,  average  computation  cost  of  a 
node,  ave-comp,  and  average  message  size  to  be  transferred 
among  tasks,  avejnsgsize ,  are  given  as  inputs.  First,  the 
computation  time  of  each  task  on  every  compute  resource  is 
randomly  selected  from  a  uniform  distribution  with  a  mean 
equal  to  avg-cotnp .  Starting  with  the  first  task,  the  number 
of  children  (out-degree)  is  randomly  selected  between  1  and 
maxjoutdegree.  Then,  children  are  randomly  selected  for 
this  task.  The  weight  of  each  edge  in  the  DAG  is  randomly 
selected  from  a  uniform  distribution  with  a  mean  equal  to 
ave-msgsize.  Resource  requirements  for  each  task  are  ran¬ 
domly  selected  from  available  resources.  The  amount  of 
data  to  be  transferred  to/from  each  resource  in  the  resource 
requirements  set  is  randomly  selected  from  a  uniform  dis¬ 
tribution  with  a  mean  equal  ave-datasize.  The  sizes  of  ran¬ 
dom  DAGs  range  from  50  to  250  tasks  with  increments  of 
50. 
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Figure  7.  Performance  of  the  allocation 
heuristics  when  using  selection  strategy  51 


5.2.  Baseline  Algorithm 


Many  mapping  algorithms  exist  in  the  literature  for  map¬ 
ping  DAGs  in  HC  systems.  None  of  these  algorithms  con¬ 
sider  the  co-allocation  problem  we  define  in  this  paper. 
Therefore,  we  will  use  a  simple  list  scheduling  algorithm 
as  a  baseline  algorithm  to  evaluate  our  co-allocation  algo¬ 
rithm.  The  baseline  algorithm  is  a  fast  static  algorithm  for 
mapping  DAGs  in  HC  environments.  It  partitions  the  tasks 
in  the  DAG  into  levels  using  an  algorithm  similar  to  the  level 
partitioning  algorithm  described  in  Section  4.1 .  Then  all  the 
tasks  are  ordered  such  that  the  tasks  in  level  k  come  before 
the  tasks  in  level  k- hi.  The  tasks  in  the  same  level  are  sorted 
in  descending  order  based  on  the  average  execution  time  of 
each  task  (ties  are  broken  arbitrarily).  The  tasks  are  consid¬ 
ered  for  mapping  in  this  order.  A  task  is  mapped  to  the  re¬ 
quired  resources  such  that  its  finish  time  is  minimized. 

The  Baseline  algorithm  is  similar  to  our  algorithms  in 
the  sense  that  all  algorithms  proceed  level-by-level.  In  the 
q  ocpiinp.  algorithm,  the  scheduling  order  of  tasks  at  the  same 

the. 
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Figure  8.  Performance  of  the  allocation 
heuristics  when  using  selection  strategy  52 
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Figure  9.  Performance  of  the  allocation  Figure  11.  Performance  of  the  selection 

heuristics  when  using  selection  strategy  53.1  strategies  with  HAETF  allocation  heuristic 
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Figure  10.  Performance  of  the  allocation 
heuristics  when  using  selection  strategy  53.2 


Figure  12.  Performance  of  the  selection 
strategies  with  MIN  allocation  heuristic 


5.3.  Experimental  Results 
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Figure  13.  Performance  of  the  selection 
strategies  with  MAX  allocation  heuristic 
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Figure  14.  Performance  of  the  selection 
strategies  with  HDF  allocation  heuristic 


Our  experimental  results  are  given  in  Figures  7-  14.  The 
total  number  of  tasks  were  varied  from  50  to  250  with  in¬ 
crements  of  50.  Each  point  in  the  figures  is  an  average  of 
400  runs  with  different  random  DAGs.  Random  DAGs  were 
generated  with  max-Outdegree-{ 2,3 >4,5},  ave-comp- 50, 
avejnsgsize= 50K  byte,  and  ave-data^size- 300K  byte. 

Figures  7-  10  show  the  performance  results  of  our  allo¬ 
cation  heuristics  compared  to  the  Baseline  algorithm  when 
using  different  maximal  independent  set  selection  strategies. 
The  improvement  of  our  heuristics  over  the  Baseline  in¬ 
creases  as  the  total  number  of  tasks  increases.  This  shows 
the  importance  of  considering  co-allocation  requirements  in 
mapping  algorithms.  Generally,  our  allocation  heuristics 
have  relatively  the  same  performance. 

The  performance  results  of  maximal  independent  sets  se¬ 
lection  strategies  when  using  different  allocation  heuristics 
are  given  in  Figures  11-14.  As  in  the  previous  set  of  re¬ 
sults,  the  improvement  over  Baseline  algorithm  increases  as 
total  number  of  tasks  increases.  Also,  the  selection  strate¬ 
gies  have  relatively  same  performance. 

In  our  simulation  study,  we  found  that  the  number  of  ma¬ 
chines  and  the  number  of  resources  did  not  have  a  signifi¬ 
cant  impact  on  the  performance  of  allocation  heuristics  and 
selection  strategies. 

6.  Conclusions  and  Future  Work 

This  paper  proposes  a  novel  framework  for  the  problem 
of  mapping  applications  with  resource  co-allocation  in  HC 
systems.  We  formulated  the  co-allocation  problem  and  de¬ 
veloped  several  algorithms  for  solving  this  problem  using  a 
graph  theoretic  approach.  Our  simulation  results  show  the 
importance  of  considering  the  co-allocation  requirements 
during  mapping  decisions. 

In  solving  our  co-allocation  problem,  we  need  to  find 
maximal  independent  sets  among  tasks  competing  for  sys¬ 
tem  resources.  Although  we  considered  many  heuristics, 
they  all  seem  to  perform  equally  well  indicating  that  a  sim¬ 
ple  heuristic  will  suffice  (even  though  one  can  create  patho¬ 
logical  examples  for  each  heuristic). 

In  our  future  work,  we  plan  to  expand  our  framework 
to  consider  concurrent  usage  of  multiple  compute  resources 
and  advance  resource  reservations.  With  advance  reserva¬ 
tion,  system  resources  can  be  reserved  in  advance  for  spe¬ 
cific  time  intervals.  Therefore,  resource  availability  must  be 
expressed  as  a  list  of  available  time  slots  and  mapping  algo¬ 
rithms  should  be  insertion-based  algorithms.  To  co-allocate 
a  set  of  resources  in  this  case,  efficient  algorithms  are  needed 
to  find  the  best  time  slot  when  all  resources  are  available 
for  the  required  duration.  In  this  paper,  our  algorithms  are 
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non  insertion-based  since  the  earliest  available  time  for  a  re¬ 
source  r{  is  the  finish  time  of  the  last  task  assigned  to  ra-. 
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Abstract 

Dynamic  real-time  systems  face  many  resource  man¬ 
agement  problems.  This  paper  addresses  the  following 
problems:  (1)  dynamic  resource  allocation  to  provide 
QoS  objectives,  (2)  heterogeneous  resources,  and  (3) 
non-intrusive  accurate  monitoring  of  QoS,  resource 
availability,  and  resource  needs.  This  paper  describes 
the  techniques  of  resource  manager  (RM)  handling 
above  problems  to  support  QoS  of  dynamic  distributed 
real-time  systems.  The  contributions  of  this  paper  to 
solve  these  problems  are  as  follows:  unification  of  dy¬ 
namic  resource  requirements  among  heterogeneous 
hosts,  control  of  resources  in  heterogeneous  environ¬ 
ments,  feasibility  analysis,  and  dynamic  load  balanc¬ 
ing/sharing.  Our  heuristic  allocation  scheme  not  only 
allows  higher  workloads  than  random,  round  robin, 
least  load  by  257%,  142%,  and  36.4%,  respectively, 
but  also  improves  QoS  better  than  random,  round 
robin,  and  least  load  38.6%,  28.5%,  and  31.6%,  re¬ 
spectively. 


1.  Introduction 

This  paper  describes  techniques  for  managing  het¬ 
erogeneous  host  resources  to  support  QoS  of  dynamic 
distributed  real-time  systems.  Our  approach  is  based  on 
the  dynamic  path  paradigm.  A  path-based  real-time 
subsystem  (see  [1],  [2])  typically  consists  of  a  detec¬ 
tion  &  assessment  path,  an  action  initiation  path,  and 
an  action  guidance  path.  The  paths  interact  with  the 
environment  by  evaluating  streams  of  data  from  sen¬ 
sors,  and  by  causing  actuators  to  respond  (in  a  timely 
manner)  to  events  detected  during  evaluation  of  sensor 
data  streams. 

An  overview  of  our  approach  for  RM  is  shown  in 
Figure  1.  The  “s/w  spec”  is  used  to  describe  QoS 


requirements.  The  “h/w  spec”  also  defines  information 
about  the  hosts  and  networks  such  as  speed,  OS  type, 
the  number  of  CPUs,  benchmark  rate,  bandwidth,  and 
interconnected  equipment.  The  “QoS  managers”  col¬ 
lect  QoS  metrics,  compare  to  s/w  spec,  and  request 
resources,  if  QoS  violations  occur.  The  “resource  man¬ 
ager”  is  the  brain,  which  makes  allocation  decisions  to 
achieve  QoS  objectives. 

This  paper  focuses  on  the  resource  management 
component,  and  discusses  a  new  technique  for  dynamic 
feasibility  analysis  on  heterogeneous  resource  plat¬ 
forms.  Most  previous  work  in  distributed  real-time 
systems  assumed  that  all  system  behaviors  follow  a 
statically  known  pattern  (see  [3]  [4]).  When  applying 
the  previous  work  to  some  applications  (such  as  ship¬ 
board  AAW  systems  [1][5]),  problems  arise  with  re¬ 
spect  to  scalability  of  analysis  and  modeling  tech¬ 
niques;  furthermore,  it  is  sometimes  impossible  to 
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Table  1.  System  resource  model 


SYMBOL 

NOTATION 

ay 

name  of  application  j  in  path  i 

tl  =  tl(C,  Pi) 

workload  or  tactical  load  at  cycle  c  in  path  i 

CUP0bS(ay,  tl,  Hk) 

the  CPU  user-percentage  of  host  Hk  for  the  application  aj  in  path  i  at 
work  load  tl, 

CUPuni(ay,  tl,  Hk) 

the  unified  CPU  user-percentage  of  host  Hk  for  the  application  aj  in 
path  i  at  work  load  tl 

CUP(H,,t) 

the  CPU  user-percentage  of  host  Hj  at  time  t 

CIP(Hj,t) 

the  idle-percentage  of  host  Hj  at  time  t 

MEMobs(ay,  tl,  Hk) 

the  memory  usage  of  application  aj  in  path  i  on  host  Hk  at  work  load  tl 

FAM(Hj,t) 

the  free-available-memory  of  host  Hi  at  time  t 

Xobs(ay,  tl,  Hk) 

the  execution  time  of  application  aj  in  path  i  on  host  Hk  at  work  load  tl 

Pobs(ay,  tl,  Hk) 

the  period  of  application  aj  in  path  i  on  host  Hk  at  work  load  tl 

CCR(Hj) 

CPU  clock  rate  in  MHz  at  host  Hj 

SPECint95(Hi) 

the  fixed  point  operation  performance  of  SPEC  CPU95  of  host  Hj 

SPECfp95(H:) 

the  floating  point  operation  performance  of  SPEC  CPU95  of  host  Hj 

SPEC_RATE(Hj) 

the  relative  SPECCPU95  rating  of  host  Hj 

Threshold_CPU(Hj) 

the  CPU  threshold 

Threshold MEM(Hj) 

the  memory  threshold 

obtain  some  of  the  parameters  required  by  the  models. 

In  contrast,  DeSiDeRaTa  RM(see  [6][2])  allows  the 
modeling  of  systems  that  work  in  environments  that 
have  unknown  scenarios  (such  as  battle  environments) 
(see  [7]);  the  dynamic  path  paradigm  is  based  on  ob¬ 
tainable  parameters,  since  it  evolved  from  the  study  of 
existing  computer  systems;  and  the  large  granularity  of 
the  path  makes  it  more  scalable  than  task  allocation 
approaches. 

The  new  contributions  of  this  paper  are  as  follows: 
(1)  unification  of  dynamic  resource  requirements 
among  heterogeneous  hosts,  (2)  control  of  resources  in 
heterogeneous  environments,  (3)  feasibility  analysis, 
and  (4)  dynamic  load  balancing/sharing. 

Section  2  shows  the  feasibility  analysis  and  laxity 
based  RM  approach  with  system  model.  Section  3 
shows  the  results  of  experiments.  Finally,  Section  4  is 
the  summary  and  conclusion. 

2.  Laxity  Based  RM 


In  this  section,  the  resource  management  approach  is 
explained.  Basic  steps  of  dynamic  resource  manage¬ 
ment  are  follows:  (step  1)  Resource  Requirement  , 
(step  2)  Resource  Discovery,  (step  3)  Resource  Unifi¬ 
cation.  (step  4)  Feasibility  Analysis,  and  (step  5)  Opti¬ 
mization.  These  steps  are  explained  in  detail  in  the 
remainder  of  this  section.  First,  a  mathematical  model, 
which  is  used  in  the  detailed  explanation,  is  presented. 

Table  1  shows  the  system  resource  model,  ay  and  tl 
represent  application  and  workload  of  an  application, 
respectively.  Indices  starting  with  CUP  stand  for  CPU 
usage.  CIP  is  CPU  idle  percentage  of  a  host.  MEM  and 
FAM  relate  to  memory  usage.  X  and  P  are  the  execu¬ 
tion  time  and  period  of  an  application  ay,  respectively. 
CCR  stands  for  CPU  clock  rate  of  a  host.  The  SPEC 
CPU95  host  benchmark  consists  of  SPECint95  and 
SPECfp95  that  show  relative  performance  of  fixed  and 
floating  point  operations  in  a  system.  SPEC_RATE  is 
overall  relative  system  rank  Indices,  Threshold,  are 
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certain  amount  of  resource  to  tolerate  different  amount 
of  resource  requirement. 

The  steps  taken  by  RM  are  now  explained  in  detail. 
The  resource  requirement  step  works  as  follows.  QM 
detects  QoS  violation  by  monitoring  QoS  of  a  path  and 
each  application  and  requests  additional  resources 
based  on  decisions.  When  a  significant  amount  of 
workload  is  observed,  QM  analyzes  the  latency  of  each 
application.  If  ay  uses  more  resources  than  others,  or 
the  latency  of  ay  is  higher  than  minimum  QoS  slack, 
then  QM  triggers  request  of  additional  resources  with 
another  copy  of  the  application.  This  is  called  “scale- 
up”  decision.  When  workload  is  not  changed,  but  QoS 
violation  occurs,  QM  triggers  migration  of  ay  running 
on  the  overloaded  host.  It  is  called  ”move”  decision. 

Therefore,  different  resource  requirements  should  be 
measured  according  to  decisions.  Hence,  for  “move” 
decision,  RM  measures  dynamic  resource  requirement 
of  CPU  for  the  violated  application  using  CUPobs(ay,  tl, 
Hk)  =  X((ay,  tl,  Hk)  /  P(ay,  tl,  Hk).  For  the  “ scale-up ” 
decision,  the  resource  requirement  is  measured  by  in¬ 
terpolation  and  extrapolation  from  the  initial  profile  for 
the  new  workload:  tl  =  current  tl  /  (replicas  +  1).  The 
resource  discovery  step  is  explained  here  in  detail. 

Monitoring  of  resource  availability  in  dynamic  envi¬ 
ronments  has  more  difficulties  than  in  static  environ¬ 
ments,  because  of  unknown  system  activation. 
FAM(Hk,t),  CIP(Hk,t),  and  CUP(Hk,t)  are  collected  for 
all  host  “k”  once  per  second.  And  these  resources  are 
filtered  by  exponential  moving  average(EMA)  as  illus¬ 
trated  below  for  CUP: 

EMA(CUP(Hj,t))  =  (l-P)*(CUP(Hj,t))  +  p  *  EMA 
(CUP(Hj,t-l)),  where,  t  >  1,  (3=  e‘T. 

Each  resource  has  various  scales  and  capacities  even 
in  the  same  unit  among  heterogeneous  platforms.  In 
this  step,  resource  unification  method  is  explained  in 
detail. 

Definition:  Resource  unification  produces  a  canonical 
form  of  each  resource  metric. 

RM  allocates  and  controls  resources  accurately,  if 
each  resource  is  unified.  Consider  CUPobs(ay,  tl,  Hk)  as 
resource  requirement.  To  allocate  the  amount  of  the 
resource,  RM  needs  to  analyze  the  requirement  and 
map  it  to  target  hosts.  There  are  two  approaches,  static 
and  dynamic.  The  static  approach  uses  stable  system 
information  like  benchmarks,  or  CPU  clock  rate.  It  will 
decide  relative  amount  of  system  resources  efficiently 
but  inaccurately  for  dynamic  environments.  The  dy¬ 
namic  approach  of  predicting  execution  time  using 
dynamic  system  information  has  high  complexity  for 
real-time  systems,  as  an  application  uses  several  differ¬ 
ent  resources  such  as  I/O  disk,  memory,  and  CPU,  each 
of  which  has  different  performance  among  hosts. 


Therefore,  a  static  approach  is  selected  as  follows.  For 
the  unification  of  resources,  the  results  of  a  variety  of 
realistic  SPEC  CPU95  will  give  valuable  insight  into 
expected  real  performance  among  heterogeneous  hosts. 

However,  no  one  benchmark  can  fully  characterize 
overall  system  performance.  SPEC  CPU95  measures 
the  performance  of  CPU,  memory  system,  and  com¬ 
piler  code  generation  by  running  18  programs  that  are 
well  designed  to  gather  their  throughput.  The  geomet¬ 
ric  mean  is  used  to  represent  system  overall  perform¬ 
ance  compared  to  a  reference  machine,  Sun-sparc- 
10/40Mhz.  This  standardized  set  of  benchmarks  (SPE- 
Cint95  and  SPECfp95)  is  adaptable  to  the  recent  gen¬ 
eration  of  high-performance  computing  efficiently 
(HPC)  [8].  Hence,  the  following  formula  (1)  is  used  to 
unify  CPU  resource,  CUPuni(ay,  tl,  HT),  onto  target  host, 
HT  from  CUPobs(ay,  tl,  Hk)  on  source  host,  H. 

CUPwm(ay,  tl,  Ht)  -  CUP^ay,  tl,  H,)  * 
SPEC_RATE(Hk)  /  SPEC_RATE(HX)  -(1) 
Where  Vj,  SPECjlATE(Hi)  =  AVG(SPECint95(Hi)  / 
Max(SPECint95(Hj),  SPECfp95(Hi)  / 
Max(SPECfp95(Hj))). 

Another  piece  of  static  system  information,  CCR(Hj) 
is  considered  but  it  is  inapplicable  to  unification  of 
resources,  because  a  different  number  of  CPU  cycles 
between  RISC  and  CISC  are  used,  and  because  differ¬ 
ent  VLSI  technology  is  used,  for  example,  Sun  Ultral- 
167Mhz  has  better  performance  than  SPARC5- 
170Mhz. 

Now,  the  feasibility  analysis  steps  are  illustrated  as 
follows.  The  best-host  approach  (see  [9])  without  con¬ 
sideration  of  resource  availability  does  not  guarantee 
load  balance.  Therefore,  this  step  distinguishes  feasible 
hosts  in  terms  of  resource  availability  based  on  the 
unified  resource.  Furthermore,  in  formula  (2),  the 
thresholds  for  the  load  balancing  process  include  CPU 
idle  time  and  available  memory;  the  current  CPU  and 
memory  usage  of  the  process  that  is  to  be  migrated  are 
compared  against  the  thresholds  to  determine  the  desti¬ 
nation  host.  If  a  host  satisfies  the  condition  of  feasibil¬ 
ity  analysis  in  formula  (2)  and  no  faults  are  detected  on 
the  host,  then  it  is  a  candidate  host. 

(FeasibleCpu(Hi,t)  =  CIP(Hi,t)  -  CUPwm(ay,  tl,  Hi))  > 
Threshold_CPU(Hi)  & 

(FeasibleMEM(Hi,t)  =  FAM(Hht)  -  MEVUfoj,  tl,  Hi))  > 
Threshold_MEM(Hi)  -(2) 

Finally,  the  optimization  step  is  explained.  Opti¬ 
mized  resources  give  good  information  to  RM  for  effi¬ 
cient  allocations. 

Definition:  Laxity  is  an  available  amount  of  unified 
resources  after  allocation  of  requested  resources  deliv¬ 
ered  from  QM  for  the  violated  applications. 
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1 .  QM  request  resources,  CUP0bs(aij,  tl,  HL),  MEMobs(aij,  tl,  Hl) 

2.  Get  the  host  list,  HL,  including  host  load  indices,  load  metrics(LM) 

3.  Calculate  EM  A  of  LM 

4.  No_of_Candidate_Host  =  0  ; 

5.  Create  Linked  List  of  HL_CPU  ; 

6.  Create  Linked  List  of  HL_MEM  ; 

7.  For  (k  =  first(HL(Hj,  t)) ;  k  <=  last(HL(Hi,  t))) 

8.  CUPu„i(aij,  tl,  Hk)  =  CUP0bs(aij,  tl,  HL)  *  SPEC_RATE(HL)  /  SPEC_RATE(Hk); 

9.  FeasibleCPu(Hkl  t)  =  CIP(Hk,  t)  -  CUPuJa.j,  tl,  Hk)  -  Threshold_CPU(Hk); 

1 0.  FeasibleMEM(Hk.  t)  =  FAM(Hk,  t)  -  MENWay,  tl,  Hk)  -  Threshold_MEM(Hk); 

1 1 .  lf(Feasiblecpu(Hk.  t)  >  0)  &&  (FeasibleMEM(Hk,  t)  >  0 ) 

12.  /Cpu(Hk,t)=  Feasiblecpu(Hk.  t) )  *  SPEC_RATE((Hk) ; 

1 3 .  /MEM(Hk,t)=  FeasibleMEM(Hk,  t) ) ; 

14.  Append  Hk  and  /cpu(Hk,t)  to  HL_CPU  ; 

15.  Append  Hk  and  /MEM(Hk,t)  to  HI  .  MEM  ; 

16.  No_of_Canidate_Host ++ ; 

17.  //end  if  11 

18.  Loop  7; 

1 9.  Sort  HL_CPU  in  descending  order  of  /cpu(Hj,t) 

20.  Sort  HL_MEM  in  descending  order  of  /MEM(Hj,t) 

21.  If  (  No_of_Candidate_Host  =  0 )  Retum(Target_Host  =  first(HL_CPU)) 

22.  Target_Host  =  first(HL_CPU) ; 

23.  While(true) 

24.  If((Target_Host  is  Alive)  &&  (Target_Host  is  in  top  50th  percentile  of  HL_MEM)) 

25.  Return  (Target_Host) ; 

26.  Else  Target_host  =  next(HL_CPU) ; 

27.  Loop  23  ; 


Figure  2.  Resource  allocation  algorithm 


FeasibleCPU(H;,t)  is  the  available  amount  of  resources 
after  allocation  of  ay  Unifying  the  Feasiblecpu(Hj,t) 
gives  the  optimized  resource  availability.  This  optimi¬ 
zation  is  an  important  QoS  factor.  Formula  (3)  and  (4) 
show  the  Laxity  of  CPU,  /Cpu(Hi,t),  and  Laxity  of 
memory,  /MEM(Hi,t). 

ZCpu(Hj,t)  =  FeasibleCpu(Hj,t)  *  SPEC_RATE(Hj),-  (3) 
4iEM(Hi,t)  =  FeasibleMEM(Hi,t)  -  (4) 

Based  on  optimized  resources,  the  resource  allocation 
schemes,  max-laxity  host(AMax)  shown  in  formula  (5), 
and  min-laxity  host(Amin)  shown  in  formula  (6)  are 
carefully  considered.  Other  approaches  such  as  ran¬ 
dom^),  round-robin(rr),  and  least-load(W)  have  been 
tested  and  compared  with  our  allocation  schemes  of 
resource  optimization.  But  the  least  load  approach  (re¬ 
sources  are  not  unified)  shown  in  formula  (7)  does  not 
guarantee  QoS  requirement  as  the  available  resources 
in  the  supply  space  do  not  correspond  to  resource  re¬ 
quirement  in  demand  space. 


ylMax  =  MaXj(/Cpu(Hj,t))  and  Topi(/MEM(Hi,t),  50)  -  (5) 
Ami„  =  mini(/Cpu(Hj,t))  and  Botj(/MEM(Hi,t),50)  -  (6) 
tl  =  Maxi  (CPI(Hi,t)*Wcpu  +  FAM(Hj,t)*  Wmcm  )  -  (7) 
where  Topi(/MEM(Hi,t),  50) :  the  host  “i”  is  in  top  50,h 
percentile  in  laxity  of  memory, 
Boti(/MEM(Hi,t),  50) :  the  host  “i”  is  in  bottom  50th  per¬ 
centile  in  laxity  of  memory, 

W+W  =  1  V 

cpu“  yy  mem  1 1  v  i 

In  our  approach,  the  other  resource  requirements  like 
network  bandwidth,  I/O  disk  are  applicable  in  a  similar 
way.  The  final  decision  is  made  based  on  the  laxity  of 
each  resource  using  heuristic  algorithm:  find  a  host  that 
has  maximum  Acpu;  if  the  host  is  in  top  50th  percentile 
of  the  host  list  (sorted  by  AMEM(H„t));  select  the  host;  if 
not,  examine  the  next  host  that  has  maximum 
ACpu(H„t).  Figure  2  explains  resource  allocation  algo¬ 
rithm  in  detail. 

Instead  of  resource  allocation,  control  of  heteroge¬ 
neous  resources  is  an  efficient  way  to  provide  quick 
resource  management.  Dynamic  CPU  proportion 
change  on  Linux  using  the  Quasar  scheduler  [  1 0]  [  1 1  ] 
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and  priority  handling  on  NT  and  Solaris  are  imple¬ 
mented  in  our  scheme. 

Furthermore,  for  accurate  allocation,  the  RM  should 
consider  not  only  load  balance  based  on  resource  avail¬ 
ability,  but  also  a  measure  of  system  contention  called 
slowdown  factor.  This  is  ongoing  study,  especially  in 
the  area  of  network  load  between  two  communication 
nodes. 

3.  Experiments 

We  have  used  DynBench[12]  as  an  assessment  tool 
for  DeSiDeRaTa.  It  uses  an  identical  scenario  for  ex¬ 
periments.  The  experimental  system  parameters  and 
heterogeneous  environment  are  as  follows:  1  Linux 
Pentium  200mhz,  1  NT  Pentium-Ill  500mhz,  2  NT 
Pentium-II  400mhz,  2  NT  Pentium  200mhz,  2  Solaris 
Sparc-5  170mhz,  2  Solaris  Ultra- 1  167mhz,  1  SunOS 
on  ULTRA 10  300mhz,  and  lOOMhz  Fast  Ethernet. 

The  first  experiment  monitors  and  analyzes  resource 
requirements  corresponding  to  step  1  in  section  2.  The 
second  experiment  measures  the  unification  ap¬ 
proaches  corresponding  to  step  3  in  section  2.  The  third 
experiment  compares  different  allocation  schemes  cor¬ 
responding  to  step  5  in  section  2.  Experiment  details 
are  presented  in  the  remainder  of  this  section. 

Experiment  1  shown  in  Figure  3  describes  the  meas¬ 
ures  of  variance  of  execution  time  with  different  meth¬ 
ods  and  different  periods  in  (c),  and  variance  of  mem¬ 
ory  among  hosts  in  (a)  and  (b).  The  three  different 
monitoring  techniques,  getrusage()(GRU)  system  call, 
reading  process  table(PT),  and  ps(PSU)  command,  are 
used.  From  the  experiments  (c),  the  variance  of  execu¬ 
tion  time  measured  by  reading  PT  is  high,  and  is  de¬ 
pendent  on  the  monitor  cycle  time  as  the  period  for 
accessing  PT  cannot  exactly  cover  the  range  of  process 
execution  time.  It  is  impossible  to  collect  exact  re¬ 
source  usage  of  a  process  at  a  particular  instant  of  time. 
However,  the  GRU  system  call  shows  accurate  process 
resource  usage  in  terms  of  variance  of  execution  time. 
The  exponential  moving  average  (EMA)  of  each 
method  is  used  for  filtering.  The  maximum  difference 
of  memory  usage  by  the  evaluate  and  decide  applica- 
tion(ED)  on  two  different  hosts  is  48Kbytes(from  (a) 
and  (b)  in  Figure  3).  Hence,  Threshold_MEM(Hk,t)  and 
Threshold_CPU(Hk,t)  are  necessary  components  to 
constraint  candidate  host.  The  variances  of  memory 
requirement  of  applications  are  measured  by  zero. 

The  second  experiment  described  in  Figure  4  shows 
the  difference  between  observed  resource  usage  and 


unified  resource  estimated  by  SPEC„RATE  and 
Clock_Rate(CCR).  For  example,  the  execution  time  is 
collected  on  Pentium-200,  and  we  multiply  the  meas¬ 
ured  execution  time  and  SPEC_RATE/CCR  of  the  tar¬ 
get  host,  PentiumIII-500.  Next  we  experiment  with  the 
same  scenario  on  PentiumIII-500  to  observe  actual 
execution  time  of  the  process  to  compare  with  previous 
estimated  execution  time.  The  difference  between  uni¬ 
fied  resource  by  CCR  and  the  observed  resource  is  8% 
on  NT,  and  3.5%  on  Sun.  The  difference  of  unified 
resource  by  SPEC_RATE  has  1%  on  NT,  and  11%  on 
Sun. 

Experiment  3  proposed  three  measurements  -  QoS 
violation  rate  (QVR),  QoS  Sensitivity  (QSS),  and  QoS 
(to  compare  QoS  characteristics  by  different  allocation 
decision  algorithm  as  shown  in  Figure  5).  The  QVR  is 
the  number  of  violations  within  2  minutes  by  increas¬ 
ing  workload.  QSS  is  the  amount  of  workload  to  trig¬ 
ger  second  violation  after  the  first  violation.  QoS  is  the 
latency  of  a  path  improved  by  first  allocation.  This 
experiment  shows  clearly  that  the  ti  approach  that  ig¬ 
nores  heterogeneity  (proposed  by  Ravindran  [9])  is 
much  worse  than  our  scheme  in  terms  of  QVR,  QSS, 
and  QoS.  Our  AMax  scheme  improves  26.4%  better  in 
QoS,  36.4%  better  in  QSS,  and  60%  less  in  QVR  than 
it  approach. 

4.  Conclusion  and  Future  Study 

This  paper  presents  5  solutions  of  resource  alloca¬ 
tion  for  dynamic  real-time  systems.  Our  AMax  scheme 
not  only  allows  higher  workloads  than  ra ,  rr,  and  it 
by  257%,  142%,  and  36.4%,  respectively,  but  also  im¬ 
proves  QoS  better  than  ra,  rr,  and  U  by  38.6%  28.5% 
31.6%,  respectively.  Controlling  heterogeneous  re¬ 
sources  using  CPU  proportion  change  and  priority 
change  is  useful  for  the  server  programs.  The  effi¬ 
ciency  of  resource  allocation  in  terms  of  QoS  objec¬ 
tives  for  scalable  and  moveable  clients  is  better  than 
that  of  the  control.  Ongoing  work  includes  finding  a 
specific  solution  of  the  resource  management  for  hard- 
real  time  applications,  Predictive  RM,  Proactive  RM, 
and  QoS  negotiation.  Also,  heterogeneous  network 
resource  monitoring  and  allocation,  and  the  decision 
mechanism  between  allocation  and  control  is  an  im¬ 
portant  issue  in  providing  QoS  requirements. 
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Figure  3.  The  Dynamic  measures  of  monitored  resource  requirement 
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Figure  4.  Resource  unification  by  SPEC_RATE  and  Clock_Rate 
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Figure  5.  Comparison  of  resource  management  schemes 
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Abstract 

The  use  of  multicomputer  clusters  composed  of  cheap 
workstations  connected  by  high-speed  networks  is  common 
in  modem  high-performance  computing .  However,  operat¬ 
ing  system  research  in  such  environments  has  lagged.  Our 
research  aims  at  enhancing  the  functionality  of  the  operat¬ 
ing  system  by  providing  management  functions  that  allow 
dynamic  resource  sharing  and  performance  prediction  in  a 
clustered  environment  supporting  distributed  shared  mem¬ 
ory  and  multithreading.  Central  to  this  approach  is  the 
development  of  a  parametric  cost  model  that  can  predict 
the  performance  ramifications  of  policy  choices  and  allow 
applications  and  middleware  to  adapt  to  the  computing  en¬ 
vironment  and  achieve  better  performance. 


1  Introduction 

The  goal  of  our  research  is  to  develop  and  evaluate  a 
parametrical  cost/benefit  model  to  be  used  as  a  decision¬ 
making  tool  for  managing  dynamic  system  resource  sharing 
in  an  environment  of  high-performance  heterogeneous  clus¬ 
ters.  Multicomputer  clusters  are  capable  of  achieving  com¬ 
putational  rates  equal  or  higher  than  those  of  conventional 
supercomputers.  However,  the  performance  these  systems 
deliver  to  applications  is  usually  just  a  fraction  of  their  max¬ 
imum  capacity,  even  in  cases  of  applications  that  can  theo¬ 
retically  achieve  much  higher  computation  rates.  Software 
inefficiency  has  to  be  blamed  for  this  phenomenon.  We  are 
developing  mechanisms  that  better  share  system  resources 
and  promote  parallelization  of  tasks.  These  mechanisms 
allow  applications  to  self-adapt  to  the  varying  availability 
of  system  resources  according  to  their  own  varying  resource 
requirements  and  thus  run  more  efficiently. 

The  key  idea  in  this  research  is  the  notion  of  cost,  in 
terms  of  execution  time,  and  the  ramifications  of  certain  op¬ 
erating  system  services  (including  process/thread  creation, 
process/thread  placement,  and  inter-process  communica¬ 


tion).  This  cost  varies  with  several  parameters  such  as  ap¬ 
plication  requirements,  application  behavior,  time,  system 
configuration,  and  network  topology.  Realistic  prediction  of 
the  cost  of  system  services  and  choices  gives  an  application 
the  ability  to  make  its  own  decisions  regarding  the  execution 
environment  that  best  suits  its  needs. 

We  are  therefore  developing  a  parametric  cost  model 
for  predicting  the  cost  of  certain  operating  system  services 
while  taking  into  account  system  characteristics  and  appli¬ 
cation  information,  as  well  as  a  set  of  software  extensions  to 
the  operating  system  to  support  the  function  of  this  model 
and  facilitate  the  dynamic  sharing  of  system  resources.  The 
result  will  be  an  operating  system  with  flexible  resource 
control,  able  to  deliver  a  higher  percentage  of  underlying 
system  performance  to  applications  and  middleware. 

Section  2  of  this  paper  presents  the  motivation  for  this 
work,  section  3  the  hardware  and  software  environment  un¬ 
der  consideration,  and  section  4  the  family  of  applications 
of  interest.  Section  5  describes  our  approach  in  more  detail, 
while  section  6  discusses  related  work.  Section  7  gives  a 
brief  summary. 

2  Motivation 

In  recent  years  there  has  been  a  new  trend  in  high- 
performance  computing:  the  use  of  multicomputers  built 
from  common,  off-the-shelf  components.  These  systems 
are  capable  of  sustaining  computation  rates  that  rival  or 
surpass  the  rates  sustained  by  conventional  supercomputers, 
but  the  demonstrated  performance  is  just  a  fraction  of  the 
maximum  theoretical  performance.  For  example,  the  Cen¬ 
turion  cluster  of  phase  I  achieved  3.7  sustained  gigaflops 
on  49  CPUs  during  a  run  of  an  ocean  modeling  applica¬ 
tion  [1].  This  code  gets  just  650  megaflops  on  a  single  Cray 
T90  CPU.  While  this  comparison  is  encouraging,  Centurion 
of  phase  I  had  a  maximum  capacity  of  over  60  gigaflops. 
While  it  would  be  naive  to  expect  to  sustain  a  rate  of  com¬ 
putation  approaching  the  maximum,  there  is  certainly  room 
to  improve  the  performance  substantially,  for  several  cat- 
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egories  of  applications.  We  identify  software  overhead  as 
one  of  the  primary  culprits  in  this  reduced  performance  is¬ 
sue.  To  overcome  this  overhead  we  have  each  software  layer 
expose  to  higher  layers  information  describing  the  behavior 
of  the  individual  layer  and  implications  that  behavior  has 
for  performance.  Also,  the  application  provides  information 
regarding  its  own  behavior  to  lower  software  layers  as  hints 
to  let  them  better  estimate  the  impact  on  performance. 

Our  approach  chooses  to  work  from  the  bottom  up,  be¬ 
cause  all  applications,  regardless  of  middleware  system,  use 
the  operating  system.  By  adding  a  set  of  services  to  the 
operating  system,  if  possible  only  as  user-level  modules  for 
reasons  of  portability,  we  aim  to  allow  dynamic  resource 
sharing  and  also  provide  a  resource  consumer  (not  neces¬ 
sarily  only  an  application)  with  information  about  the  cost 
of  a  specified  operation.  For  example,  runtime  systems  will 
be  able  to  get  information  to  make  service  guarantees  to 
applications,  and  applications  will  be  able  to  self-adapt  to 
various  system  configurations  and  varying  resource  avail¬ 
ability.  Load  balancing  environments  will  be  able  to  better 
estimate  system  efficiency  and  better  understand  the  effect 
various  task  mappings  have  on  performance.  In  these  envi¬ 
ronments  dynamic  granularity  control  will  also  be  possible. 
Finally,  smart  compilers  will  be  able  to  take  into  account 
operating  system  cost  information  during  the  compilation 
phase  of  an  application  to  optimize  the  executable  code  for 
specific  system  configurations  and  loads.  To  complement 
this,  it  will  also  be  possible  to  have  an  application  loader 
that  will  take  into  account  application  supplied  information 
and  create  the  most  suitable  environment  within  a  system’s 
limits  for  running  a  specific  application. 

3  The  Hardware  and  Software  Environment 

The  environment  in  which  the  problem  is  considered  is 
a  distributed  system  consisting  of  a  variety  of  nodes  con¬ 
nected  with  networks  of  various  speeds.  We  view  such  a 
system  as  having  a  physical  and  a  logical  organization. 

The  physical  organization  of  the  system  is  based  on  the 
principle  of  hierarchical  clustering  [13].  Groups  of  neigh¬ 
boring  nodes  form  local  clusters.  Local  clusters  can  be 
grouped  together  and  form  second  level  superclusters,  sec¬ 
ond  level  supercluster  groups  can  form  third  level  superclus¬ 
ters,  etc.  (see  figure  1).  The  main  criterion  for  forming  the 
various  hierarchy  echelons  is  the  cost  of  communications — 
these  echelons  reflect  the  underlying  networks.  That  is, 
echelon  0  corresponds  to  communications  between  proces¬ 
sors  of  the  same  node  (intra-node) — if  nodes  have  more 
than  one  processor — which  have  hardware  shared  memory 
and  thus  the  least  cost  for  sharing  information.  Echelon  1 
corresponds  to  intra-cluster  communication,  echelon  2  to 
cross-cluster  communication  between  clusters  of  the  same 
2nd  level  supercluster,  echelon  3  to  cross-cluster  communi¬ 


cation  between  clusters  of  different  2nd  level  superclusters, 
etc.  Another  way  to  think  of  this  organization  is  as  a  system 
map. 


Figure  1.  A  clustered  multicomputer  -  physi¬ 
cal  clustering 

Over  the  physical  clusters  is  laid  an  organization  of  log¬ 
ical  clusters.  The  nodes  of  a  logical  cluster  share  infor¬ 
mation  using  distributed  shared  memory.  Note  the  differ¬ 
ence  between  physical  and  logical  clusters:  physical  clus¬ 
ters  are  defined  by  the  physical  organization  of  a  system, 
mainly  by  the  nature  of  the  interconnecting  network,  and 
are  fixed;  logical  clusters  are  groups  of  processors  that  share 
memory  using  software-based  distributed  shared  memory 
(DSM)  and  can  change  dynamically.  Such  a  cluster  can  be 
a  portion  of  a  physical  cluster,  a  whole  physical  cluster,  or 
it  can  span  several  physical  clusters  incorporating  parts  or 
the  whole  of  them  (see  figure  2)). 

An  important  aspect  of  the  environment  is  the  support 
for  threads  and  for  distributed  shared  memory.  The  elemen¬ 
tary  computing  entity  here  is  the  thread  and  an  application 
includes  a  dynamically  changing  number  of  threads.  The 
enhanced  operating  system  provides  a  single  address  space 
for  the  threads  of  an  application.  This  address  space  is  vis¬ 
ible  by  a  number  of  processors  within  the  logical  cluster 
structure  and  the  threads  are  mapped  on  these  processors 
without  the  knowledge  of  the  application.  Thread  migra¬ 
tion  is  possible  in  this  environment,  as  well  as  migration 
of  thread  groups,  even  whole  applications,  within  (intra¬ 
cluster)  or  between  logical  clusters. 

The  choice  of  software  DSM  as  the  means  of  forming  a 
logical  cluster  is  a  trade-off  between  raw  performance  and 
ease  of  programming.  The  pros  and  cons  of  the  message 
passing  and  the  distributed  shared  memory  paradigm  are 
known  and  there  have  been  numerous  publications  in  this 
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Figure  2.  Logical  clustering 

area.  We  believe  that  there  is  a  lot  of  room  for  improv¬ 
ing  the  performance  of  certain  categories  of  applications  on 
multicomputer  clusters.  We  expect  that  careful  use  of  DSM 
in  combination  with  cost/benefit  prediction  and  dynamic 
adaptation  to  resource  demand/availability  will  demonstrate 
definite  performance  improvements  over  the  current  figures 
without  sacrificing  usability  and  programming  ease  on  the 
altar  of  performance.  In  any  case,  the  idea  of  using  a  cost 
model  to  predict  the  cost/benefit  of  certain  operating  sys¬ 
tem  operations  is  independent  of  using  one  paradigm  or  the 
other  and  can  contribute  in  all  cases  to  the  better  adminis¬ 
tration  of  a  system’s  resources. 

4  The  Applications 

The  type  of  application  of  interest  is  a  parallel  applica¬ 
tion  with  high  resource  demands,  such  as  a  large  scientific 
simulation.  This  type  of  application  usually  presents  a  dy¬ 
namic  change  of  resource  requirements,  that  is,  it  presents 
irregularity:  “data  structures,  communication  patterns,  or 
computation  are  not  defined  by  simple,  repeating  struc¬ 
tures”  [4]. 

Our  computational  model  mixes  shared-memory  pro¬ 
gramming  (done  with  threads)  with  distributed-memory 
programming  (done  with  a  message  passing  environment 
such  as  MPI).  Our  initial  work  was  to  support  Pthreads,  the 
POSIX-standard  threads  library,  but  we  are  moving  to  sup¬ 
port  the  emerging  standard  for  multiprocessing,  OpenMP, 
in  conjunction  with  MPI. 

As  an  example  of  this  application  type,  consider  a 
weather  simulation,  tracking  the  progress  of  a  storm  front 
as  it  moves  across  a  landscape  which  has  been  mapped  onto 
a  grid.  Each  grid  cell  can  be  mapped  onto  a  multithreaded 
MPI  process.  The  grid  cells  containing  the  storm  front 


will  require  the  most  computation,  and  as  the  front  moves, 
the  computational  hotspots  will  move  across  the  grid.  To 
achieve  good  performance,  we  want  to  rebalance  the  work¬ 
load.  Historically  this  has  been  done  either  through  data 
migration  or  computational  migration. 

Static  thread  allocation  for  an  application  with  dynamic 
resource  requirements  leads  either  to  processors  sitting  idle, 
when  resource  usage  is  overestimated,  or  to  delay  from  load 
imbalance  when  usage  is  underestimated.  Clearly,  such 
a  computation  can  only  be  performed  efficiently  with  the 
combination  of  a  load  balancing  technique  with  dynamic 
granularity  control  [3].  Using  our  approach,  one  can  add 
extra  computational  power  within  the  address  space  of  a 
hotspot,  thus  spreading  the  load  to  more  threads.  In  the 
example  of  storm  front  simulation,  we  can  add  threads  to 
the  cells  containing  the  front  (these  new  threads  might  be 
put  on  local  processors,  remote  processors  via  DSM  in  an 
expanded  logical  cluster,  or  we  might  even  choose  to  mi¬ 
grate  the  entire  process  to  a  larger  SMP  and  add  threads 
there).  Once  the  front  passes,  the  extra  threads  are  no  longer 
necessary  and  can  be  killed,  freeing  the  occupied  proces¬ 
sors.  Each  multithreaded  MPI  process  can  be  assigned  to 
a  logical  cluster  with  a  certain  number  of  nodes  that  share 
memory  and  provide  the  illusion  of  a  single  address  space 
to  the  threads  of  the  process,  while  each  thread  runs  on  a 
different  CPU.  The  MPI  processes  communicate  with  cross¬ 
cluster  messages.  Processors  can  be  dynamically  added 
to/removed  from  each  logical  cluster  in  response  to  load 
variations. 

5  Our  Approach 

Our  work  is  based  on  Linux,  which  is  the  de  facto  stan¬ 
dard  for  free  software  cluster  operating  systems.  This  has 
the  obvious  advantage  of  allowing  us  to  add  code  to  the  ker¬ 
nel  as  we  deem  necessary.  We  are  building  two  sets  of  soft¬ 
ware  to  support  our  needed  functionality:  kernel  extensions 
and  user-level  libraries.  Our  work  to  date  has  focused  on 
determining  the  required  functionality,  and  the  line  between 
what  should  go  in  the  kernel  and  what  should  go  in  the  user 
library  is  not  yet  set. 

5.1  Additional  Functionality 

We  are  augmenting  Linux  with  heterogeneous  dis¬ 
tributed  shared  memory  and  multithreading  functionality. 
Functionality  similar  to  the  Mermaid  prototype  [8]  is  de¬ 
sired  as  well  as  conformation  to  the  OpenMP  standard  [17] 
and  the  popular  Pthreads  interface. 

The  additional  functionality  that  we  are  currently  adding 
to  the  system  is  as  follows: 

1 .  Add  a  thread  or  a  group  of  threads  to  an  existing  shared 
address  space. 
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Figure  3.  The  main  parameters  of  the  cost  model 


2.  Add  a  processor  or  remove  a  processor  from  a  shared 
address  space. 

3.  Dynamically  migrate  a  thread  from  one  processor  to 
another.  The  target  processor  should  automatically  be 
added  to  the  shared  address  space,  if  not  already  in¬ 
cluded,  and  the  source  processor  should  possibly  be 
removed,  if  no  other  thread  runs  on  it.  Removing 
a  processor  from  an  address  space  might  not  always 
be  desirable  as  it  can  cause  thrashing  on  processor 
add/remove. 

4.  Form,  modify,  manage,  and  cancel  a  logical  cluster  of 
nodes.  These  functions  will  correspondingly  create, 
modify,  manage,  and  delete  the  necessary  data  struc¬ 
tures  that  need  to  be  maintained  to  support  logical  clus¬ 
ters. 

5.  Evaluate  cost  functions  to  estimate  the  cost/benefit  of 
performing  certain  operations,  as  for  example  to  cre¬ 
ate/cancel  a  cluster,  add/remove  a  processor  to/from 
an  address  space,  start/kill  a  thread  on  a  specific  pro¬ 
cessor,  and  migrate  a  thread,  which  may  result  to  the 
inclusion  of  the  target  processor  in  the  shared  address 
space  and/or  the  removal  of  the  source  processor,  as 
mentioned  above. 


5.2  The  Cost/Benefit  Model 

The  operating  system  extensions  are  mechanisms,  not 
policies.  Policy  decisions  will  be  made  either  in  middleware 
layers  or  by  the  application  itself,  e.g.  whether  or  not  to  add 
a  thread  to  a  running  process.  To  assist  the  higher  software 
layers  in  making  these  decisions,  we  have  developed  a  cost 
model  so  that  the  operating  system  can  accurately  predict 
the  ramifications  of  policy  decisions. 

Our  cost  model  moves  qualitatively  along  three  axes:  the 
first  axis  has  as  parameter  the  amount  of  system  information 
taken  into  account;  the  second  axis,  the  amount  of  applica¬ 
tion  information;  and  the  third,  the  speed  of  the  cost  esti¬ 
mation.  Figure  3  represents  a  qualitative  picture  of  the  cost 
model  behavior.  The  accuracy  of  the  model  is  generally 
inversely  proportional  to  the  speed  of  the  estimation.  The 
speed  of  the  model  depends  on  the  type  and  quantity  of 
application  and  system  information  that  must  be  processed. 
Accuracy  increases  and  speed  decreases  as  the  volume  of 
information  increases.  The  cost  of  a  cost  estimation  has  to 
be  low  when  compared  with  the  benefit  that  a  correct  policy 
decision  will  offer.  On  the  other  hand,  the  estimated  cost 
needs  to  be  accurate  enough  to  enable  correct  decisions. 
Clearly,  this  is  a  complex  problem  and  there  are  several  fac¬ 
tors  that  need  to  be  considered.  Therefore,  the  cost  model 
needs  to  be  parametric  and  provide  for  various  trade-offs  of 
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accuracy  for  speed. 

In  its  simplest  form,  the  cost  model  views  the  cost  of 
operations  as  set  of  discrete  cost  classes.  This  classification 
is  of  the  least  accuracy  and  maximum  speed  and  it  works 
when  a  fast  answer  is  needed  and  no  information  is  available 
from  the  application  side.  As  an  example,  in  a  system  with 
two  clusters  where  one  cluster  is  formed  by  dual-processor 
x86  boxes  and  assuming  3  cost  classes,  the  cost  of  adding 
a  thread  to  the  second  processor  of  a  node  belongs  to  class 
#0.  The  cost  of  adding  a  thread  to  a  new  (unused)  processor 
in  the  local  cluster  belongs  to  cost  class  #1.  The  cost  to 
migrate  a  thread  to  the  neighboring  cluster  is  of  class  #2, 
the  highest.  Here  the  accuracy  of  the  model  is  the  lowest  as 
it  relies  only  on  basic  system  information. 

In  its  richest  form,  the  model  can  provide  a  set  of  cost 
information  about  an  operation  and  the  impact  this  opera¬ 
tion  will  have  on  the  performance  of  the  application.  For 
this  purpose,  it  is  necessary  to  combine  information  from 
both  the  system  and  the  application.  The  system  maintains 
a  vector  of  state  and  cost  information  by  directly  access¬ 
ing  information  maintained  by  the  operating  system  kernels 
and  by  periodically  running  a  set  of  suitable  benchmarks 
to  measure  other  important  cost-affecting  quantities,  like 
communication  latencies  or  node  loads  (cf.  the  Network 
Weather  Service  [26]).  The  application  passes  in  a  descrip¬ 
tion  of  its  behavior,  e.g.  memory  access  patterns,  thread 
running  patterns,  acceptable  delays,  etc.  Statistical  informa¬ 
tion,  gathered  during  previous  runs  of  the  application,  can 
help  identify  its  behavior  when  this  behavior  is  not  known 
in  advance.  The  cost  model  combines  the  supplied  informa¬ 
tion  to  produce  a  cost  rating,  a  quantity  that  can  be  used  in 
comparing  the  cost  of  different  system  operations. 


Figure  4.  Three  possible  configurations 

For  example,  consider  two  neighboring  (physical)  clus¬ 
ters,  A  and  B,  with  10  processors  each,  and  an  application 


running  on  six  processors  of  cluster  A,  which  form  a  logical 
cluster,  with  six  threads,  one  per  processor  (see  figure  4)). 
Suppose  that  at  a  certain  phase  during  execution  the  appli¬ 
cation  wants  to  spawn  another  group  of  six  threads.  Would 
it  be  better  to  use  six  processors  of  cluster  A  (4  free  ones 
and  two  already  running  a  thread  each),  use  the  remaining  4 
processors  of  cluster  A  and  two  more  from  cluster  B,  or  use 
six  processors  of  cluster  B  instead?  Cross-cluster  communi¬ 
cation  is  more  expensive  than  intra-cluster  communication 
so  the  first  option  seems  more  attractive  than  the  second  and 
the  second  more  attractive  than  the  third.  However,  this  is 
not  necessarily  the  best  order.  The  first  option  suffers  from 
the  fact  that  two  processors  have  to  run  two  threads  each. 
Also,  what  is  of  highest  benefit  depends  on  how  the  appli¬ 
cation  behaves.  If  the  new  group  of  threads  works  in  close 
cooperation  with  the  initial  thread  group,  then  the  first  op¬ 
tion  is  probably  better.  If  the  group  of  these  six  new  threads 
performs  a  separate  task  and  communicates  infrequently 
with  the  initial  thread  group,  that  is,  data  shared  between 
the  two  groups  are  infrequently  accessed,  then  the  third 
option  is  probably  better.  This  is  because  the  infrequent 
communication  between  the  two  thread  groups  keeps  the 
communication  cost  low.  In  this  case,  if  the  second  option 
were  chosen,  the  frequent  communication  between  the  two 
threads  running  on  cluster  B  with  the  other  four  on  cluster  A 
would  impose  a  heavy  increase  in  the  communication  cost. 

The  following  is  a  cost  rating  estimation  for  the  three 
thread  placement  options  considered.  Since  a  highly  de¬ 
tailed  estimation  would  be  too  long  to  present  here,  certain 
simplifying  assumptions  are  necessary.  The  first  assump¬ 
tion  is  that  the  system  costs  are  identical  for  all  processors. 
A  more  detailed  estimation  would  include  the  cost  of  per¬ 
forming  certain  operations  on  each  individual  node,  or  type 
of  node.  A  second  assumption  is  that  the  costs  for  sending 
messages  are  estimated  using  the  basic  LogP  model  [16]). 
Consider  now  the  following  costs: 

$  cost  to  start  a  thread  on  a  processor, 

e  cost  to  include  a  processor  in  a  shared  address  space, 

er  cost  to  include  a  processor  that  belongs  to  a  neighboring 
cluster  in  a  shared  address  space, 

Ci  cost  for  sending  a  message  across  the  intra-cluster  net¬ 
work, 

cc  cost  for  sending  a  message  across  the  cross-cluster  net¬ 
work,  Cc>Ci , 

Let  the  application  supply  information  about  itself  in  the 
form  of  two  weight  factors,  wg)  and  wr.  wg  expresses  the 
percentage  of  accesses  to  variables  shared  between  the  new 
group  of  threads,  and  wr ,  expresses  the  percentage  of  ac¬ 
cesses  to  variable  shared  between  the  old  and  the  new  thread 
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groups.  Wg  +  wr  =  100%.  The  application  declares  that  the 
threads  of  the  new  group  communicate  frequently  with  each 
other  by  having  w9  »  u/r,  that  is,  intra-group  communica¬ 
tion  is  much  more  frequent  than  between  groups.  We’d  like 
to  compare  the  three  basic  thread  group  placement  options, 
in  the  present  case  all  six  threads  on  cluster  A  (option  6/0), 
four  threads  on  cluster  A  and  two  on  B  (option  4/2)  or  all 
six  on  cluster  B  (option  0/6). 

The  first  thing  to  estimate  is  the  overhead  for  spawning 
the  new  thread  group:  for  the  6/0  option  6(5  -1-  |e),  for  the 
4/2  option  6(5  -f  f  e  +  §er),  and  for  the  0/6  option  6 (s  +  er). 
Since  er  >  e,  the  0/6  option  has  more  overhead  than  the  4/2 
option,  and  that  in  turn  more  than  the  6/0  option,  which  is 
expected. 

This  overhead  is  not  enough  to  make  a  decision.  Here 
the  cost  rating  has  to  have  at  least  two  parts,  a  fixed  part,  the 
overhead,  and  a  variable  part,  a  rating  for  the  communica¬ 
tion  and  the  running  cost. 

For  estimating  the  communication  cost  rating,  assume 
that  m  is  the  number  of  messages  associated  with  the  access 
of  shared  data  by  threads  of  the  new  group  per  time  unit 
(message  rate).  Then  the  6/0  option  has  a  rating  of  m(w9  + 
wr)ci  =  mci.  For  the  4/2  option  assume  a  split  factor  k 
to  distinguish  between  the  messages  send  within  A  and  B 
and  those  that  cross  the  boundary  between  them.  That  is, 
k%  of  messages  are  intra-cluster  within  clusters  A  and  B, 
and  (1  -  k)%  cross-cluster  between  A  and  B.  Then  the  cost 
rating  is  m(kci  +  (1  —  fc)cc).  Finally,  for  the  0/6  option  the 
rating  is  m(wgCi  +  wrcc)  To  compare  the  rating  of  0/6  to 
that  of  4/2  we  subtract  the  first  from  the  second.  After  the 
calculations  we  obtain  m(cc  —  Ci)(w9  —  k).  Because  cc  > 
d  and  wg  »  wr  this  quantity  is  positive  when  w9  >  k . 
Having  wg  <  k  is  incompatible  with  the  assumption  that 
the  threads  of  the  new  group  communicate  frequently  with 
each  other  and  infrequently  with  the  old  group;  the  value 
of  w9  is  close  to  100%  and  a  value  for  k  approaching  w9 
would  mean  that  the  communication  between  the  subgroup 
of  the  two  threads  and  that  of  the  four  threads  is  infrequent, 
in  direct  contrast  with  the  above  assumption.  Thus,  the  cost 
rating  for  the  4/2  option  is  higher  than  that  of  the  0/6  option. 
In  turn,  the  cost  rating  of  0/6  is  higher,  but  only  slightly, 
than  6/0.  So  far,  6/0  has  the  lowest  overhead  and  cheapest 
communications,  with  0/6  second  and  4/2  last. 

To  make  the  final  decision,  it  is  necessary  to  estimate  the 
cost  of  running  for  the  three  options.  Assume  a  time  period 
A t.  During  this  period  the  new  thread  group  causes  the 
sending  of  m  At  messages.  Also  assume  that  during  A t  the 
maximum  number  of  thread  instructions  available  for  exe¬ 
cution  are  It  and  that  a  processor  can  execute  x  instructions 
per  time  unit. 

For  cases  4/2  and  0/6,  each  thread  will  take  It/x  time. 
For  the  6/0  case,  two  of  the  threads  are  running  essentially 
at  half  speed  x /2  since  they  run  on  already  occupied  proces- 
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Figure  5.  Centurion,  ethernet  connections 

sors.  Assuming  that  in  general  we  have  CPU-bound  threads, 
these  two  threads  will  take  double  the  time  of  the  other  four 
threads,  probably  delaying  their  work  (the  interference 
pattern  of  the  threads  is  another  factor  to  take  into  account 
for  more  accurate  estimations).  It  can  be  concluded  now 
that  option  0/6  is  probably  the  best  choice,  because  it  has 
almost  the  same  communication  cost  rating  with  6/0  but  half 
the  running  cost  rating,  and  in  the  long  run  the  overhead  can 
be  ignored. 

There  are  other  factors  that  can  affect  the  cost  of  each 
option.  For  example,  if  clusters  A  and  B  contain  processors 
of  different  architectures,  cross-cluster  communication  may 
be  much  more  expensive  than  before,  due  to  the  necessary 
data  conversions.  This  cost  can  be  counterbalanced  if  the 
processors  of  cluster  B  are  faster  than  the  ones  of  cluster  A, 
thus  providing  for  higher  execution  rates.  Finally,  one  has 
to  take  into  account  the  effect  of  the  already  existing  load  of 
the  processors  and  the  network  as  it  changes  with  time,  as 
there  can  be  other  applications  or  tasks  sharing  the  system 
resources. 

5.3  The  Experimental  Computing  Environment 

The  hardware  we  are  using  for  experimentation  is  a 
metacluster  comprising  the  Centurion  machine  [2]  and  the 
Syracuse  Orange  Grove  cluster.  Centurion  has  256  nodes, 
with  128  533MHz  DEC  Alpha  nodes  and  128  400MHz  dual 
Pentium  II  nodes.  Each  node  is  connected  to  a  1 00Mbps  fast 
ethernet  switch  and  multiple  such  switches  are  joined  via 
gigabit  ethernet  switches  (see  figure  5).  The  initial  64  Al¬ 
pha  nodes  are  also  joined  in  a  complex  mesh  by  a  1.2Gbps 
Myrinet  fabric  (see  figure  6).  The  Orange  Grove  cluster 
has  16  533MHz  DEC  Alpha  nodes  and  48  450MHz  dual 
Pentium  III  nodes,  and  has  a  system  area  network  similar  to 
Centurion’s.  Figure  7  presents  a  simplified  overall  view  of 
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the  total  system.  The  only  real  restriction  is  that  each  CPU  to  operating  systems,  distributed  shared  memory,  and  other 
needs  to  run  the  modified  Linux  system.  relevant  areas. 


Centurion  port]  4x16  alphas 


1 6  Compaqs  V 

not  yet  in  place 


Figure  6.  Myrinet  connected  part  of  Centurion 


Figure  7.  The  experimental  system 


6  Related  Work 

Much  of  research  work  has  been  and  is  being  dedicated 
to  issues  related  to  clusters  of  commodity  workstations.  Our 
approach  is  novel,  to  the  best  of  our  knowledge,  in  the  way 
it  supports  clustering,  in  providing  the  mechanisms  for  dy¬ 
namic  resource  sharing  in  this  framework,  and  most  of  all 
for  using  cost  models  to  estimate  the  cost/benefit  of  certain 
o.s.  operations  and  aid  in  decision  making.  The  current 
section  presents  work  related  to  our  research  with  respect 


6.1  Related  Work  in  Operating  Systems 

The  first  work  of  interest  is  a  research  project  of  the 
Computer  Systems  Research  Institute  of  the  University 
of  Toronto,  Canada,  targeted  to  hierarchically  structured 
NUMA  clusters  [14]  Part  of  this  project  was  the  design  and 
construction  of  the  Hector  hierarchically  structured  shared 
memory  multiprocessor.  Hurricane  is  the  operating  system 
that  was  specially  developed  for  Hector.  The  structure  of 
this  operating  system  reflects  the  structure  of  the  multipro¬ 
cessor.  Hurricane  uses  tight  coupling  within  a  local  cluster 
and  loose  coupling  within  clusters.  Operating  system  ser¬ 
vices  are  designed  to  take  advantage  of  high  speed  connec¬ 
tions  and  locality  of  data.  Cluster  size  is  determined  stati¬ 
cally.  Processes  of  an  application  are  scheduled  on  the  same 
cluster,  unless  there  is  a  benefit  for  a  job  to  span  multiple 
clusters.  Within  a  cluster  load  is  balanced  at  a  fine  granular¬ 
ity  and  cross-cluster  scheduling  performs  coarse-grain  load 
balancing  by  assigning  and  migrating  processes  to  specific 
clusters.  Tests  of  the  system  have  shown  that  applications 
need  to  be  adapted  to  Hector/Hurricane  to  perform  well. 

The  computing  environment  targeted  in  our  research  is 
substantially  different  from  the  Hector/Hurricane  environ¬ 
ment.  The  Hector/Hurricane  approach  was  an  attempt  to 
create  a  cost-effective,  scalable  NUMA  machine,  utilizing 
specially  designed  hardware.  Experimenting  with  hardware 
is  not  in  our  intentions.  Although  the  targeted  environment 
maintains  the  notion  of  tightly-coupled  sub-clusters  within 
a  larger  hierarchical  system,  there  is  no  hardware  support 
for  shared  memory,  except  for  the  case  of  a  multiprocessor 
PC  box.  Also,  architecture  homogeneity  is  not  necessary, 
the  cluster  hierarchy  is  not  of  fixed  depth,  and  the  (logi¬ 
cal)  cluster  size  is  not  determined  statically.  However,  there 
are  several  lessons  from  the  Hector/Hurricane  approach  that 
can  be  useful  to  us,  e.g.  that  the  factors  found  to  be  crucial 
for  application  performance  depend  on  application  behavior 
and  system  behavior.  This  is  exactly  what  our  work  inves¬ 
tigates.  The  fact  that  applications  needed  to  be  specifically 
adapted  to  run  efficiently  on  Hector  under  Hurricane  under¬ 
lines  the  importance  of  having  operating  systems  that  can 
provide  the  necessary  information  and  allow  applications  to 
self-adapt  to  the  dynamically  changing  resource  availability 
of  a  system. 

A  more  recent  work  is  the  Berkeley  Network  of  Worksta¬ 
tions  (NOW)  project  [18].  This  project  is  targeted  at  work¬ 
station  clusters  and  includes  operating-system  level  work 
software  such  as  the  GLUnix  layer  [19].  GLUnix  is  a  multi¬ 
user,  user-level  system  for  a  cluster  of  workstations.  It  is 
designed  to  provide  transparent  remote  execution  by  main¬ 
taining  a  single-system  image  across  an  entire  cluster  and 
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supports  interactive  parallel  and  sequential  jobs  and  load 
balancing.  The  proposed  research  differs  in  that  it  is  not 
dedicated  to  single-cluster  but  to  hierarchically  clustered 
systems  with  several  cluster  levels.  Although  it  is  possi¬ 
ble  to  provide  a  single-system  image  across  clusters  and 
levels  of  hierarchy,  the  approach  taken  here  is  to  provide 
the  users  with  the  necessary  data  to  make  informed  choices 
about  state  sharing  issues.  In  any  case,  increasing  the  de¬ 
gree  of  sharing  is  usually  accompanied  by  a  decrease  in 
performance,  and  users  may  be  willing  to  sacrifice  trans¬ 
parency  for  performance.  GLUnix  supports  load  balancing 
through  intelligent  job  placement  but  doesn’t  support  task 
migration  nor  any  form  of  dynamic  resource  sharing  based 
on  cost/benefit  prediction. 

6.2  Related  Work  in  Distributed  Shared  Memory 

There  is  a  vast  variety  of  research  works  on  distributed 
shared  memory  (DSM).  A  possible  classification  of  DSM 
systems  can  be  based  on: 

•  the  level  of  implementation  (user,  kernel,  hybrid), 

•  the  coherence  protocols  (SRSW,  MRSW,  MRMW, 
etc.), 

•  the  consistency  model  (sequential,  processor,  release, 
etc.). 

Three  general  techniques  have  been  used  [20].  The 
first  technique  simulates  a  multiprocessor  by  using  a  mod¬ 
ified  pagefault  trap  to  do  paging  over  the  network,  e.g. 
IVY  [23].  The  second  technique  focuses  on  shared  vari¬ 
ables  rather  than  pages.  The  programmer  is  required  to  an¬ 
notate  the  shared  variables  with  their  anticipated  access  pat¬ 
terns.  These  annotations  are  then  used  by  the  DSM  runtime 
system  for  selecting  the  most  suitable  coherence  protocol. 
This  technique  is  used  in  Munin  [5].  The  third  technique 
is  object-oriented  and  requires  a  high-level  programming 
model,  e.g.  Linda  [21],  Orca  [22]. 

Relatively  few  approaches  have  dealt  with  the  issue  of 
heterogeneity.  Mermaid  [8]  is  an  implementation  of  hetero¬ 
geneous  DSM  as  an  extension  to  the  IVY  system.  Accord¬ 
ing  to  Mermaid’s  creators,  heterogeneous  DSM  is  feasi¬ 
ble  and  presents  comparable  performance  to  homogeneous 
DSM.  However,  many  issues  need  to  be  addressed  due  to 
heterogeneity. 

Providing  yet  another  DSM  system  is  not  in  the  scope 
of  our  research.  Our  main  interest  is  in  developing  a  DSM 
system  that  can  provide  cluster-oriented  functionality  and 
cost  data  on  typical  DSM  operations.  The  preferred  way  to 
follow  here  is  to  adopt  an  existing  system  and  subsequently 
modify  and  augment  it  to  serve  the  intended  purposes.  Sev¬ 
eral  decisions  need  to  be  made,  but  the  main  direction  is 


to  adopt  the  simplest  approach  possible,  decide  the  level  of 
implementation,  and  add  the  minimum  functionality  neces¬ 
sary.  In  this  sense,  the  IVY  and  Mermaid  systems  seem 
to  be  attractive.  The  issue  of  fully  supporting  heterogene¬ 
ity  remains  open,  because  forming  clusters  with  machines 
of  heterogeneous  architectures  or  sharing  memory  among 
clusters  containing  homogeneous  hardware  but  of  different 
architecture  from  cluster  to  cluster  are  options  of  high  com¬ 
plexity  and  questionable  performance.  The  Munin  approach 
is  also  attractive  because  of  the  focus  on  the  different  access 
patterns  of  shared  variables.  Because  shared  variable  ac¬ 
cess  patterns  constitute  an  important  parameter  of  the  cost 
model,  it  would  be  interesting  to  see  the  impact  the  utiliza¬ 
tion  of  such  a  DSM  approach  has  on  performance.  How¬ 
ever,  the  use  of  a  Munin-like  system  requires  additional  pro¬ 
grammer  effort,  the  annotations  of  shared  variables,  which 
is  not  particularly  desirable  here  if  it  means  many  modifica¬ 
tions  to  already  existing  programs. 

Several  other  works  on  DSM  need  to  be  mentioned  here, 
because  they  contain  useful  elements  for  our  work.  Iftode 
et  al.  [6]  study  the  sharing  patterns  of  applications  running 
on  DSM  systems  and  identify  several  factors  affecting  per¬ 
formance.  Lu  et  al.  [7]  add  limited  compiler  support  and 
modifications  to  TreadMarks  [24]  to  eliminate  unnecessary 
computation  and  communication  during  the  execution  of 
irregular  applications.  Yoon  and  Malek  [9]  propose  the  cre¬ 
ation  of  a  single  address  space  per  running  program  instead 
of  a  global  address  space  shared  by  all  processing  nodes. 
This  is  similar  to  the  definition  of  logical  clusters  with  the 
difference  that  not  only  one  program  can  run  on  a  logical 
cluster,  no  matter  what  effect  this  fact  has  on  performance. 
Kim  and  Vaidya  [10]  propose  an  adaptive  DSM  system 
where  statistics  about  memory  accesses,  collected  over  a 
sampling  period,  are  used  for  determining  the  protocol  that 
has  the  minimum  cost  for  each  memory  page.  Erlichson 
et  al.  [11]  present  a  kernel-level  implementation  of  DSM 
on  an  8-node,  8-processors/node  cluster  and  study  the  cost 
of  DSM  primitives  and  the  effects  of  clustering  on  perfor¬ 
mance.  Finally,  Quarks  [12]  is  a  relatively  new  simple  DSM 
approach.  Quarks  is  a  descendant  of  Munin  and  is  aimed  at 
reducing  the  communication  overhead.  Its  basic  abstraction 
is  that  of  shared  regions,  which  are  page-aligned  byte  ranges 
of  variable  length.  For  this  DSM  system  there  exists  a  freely 
available  Linux  port. 

6.3  Other  Relevant  Works 

In  terms  of  functionality,  the  most  closely  related  work 
to  our  approach  is  the  Scalable  Concurrent  Programming 
Library  (SCPlib)  [4]  of  the  Scalable  Concurrent  Program¬ 
ming  Laboratory  at  Syracuse  University.  SCPlib  is  the  de¬ 
scendant  of  the  concurrent  graph  library  [25]  and  is  aimed  at 
supporting  irregular  applications  on  parallel  hardware.  The 
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library  provides  heterogeneous  communication  and  file  I/O, 
load  balancing,  and  dynamic  task  granularity  control.  The 
user  needs  to  supply  a  set  of  special  support  routines  for 
the  library  to  successfully  perform  migration  of  tasks  and 
granularity  adjustments.  The  load  balancing  mechanism  is 
based  on  the  concept  of  heat  diffusion  [3].  Communication 
cost  is  taken  into  account  when  deciding  task  movements. 

Our  work  is  targeted  at  a  more  general  and  complex  en¬ 
vironment  than  SCPlib.  The  goal  is  to  provide  function¬ 
ality  for  a  hierarchy  of  clusters  and  help  applications  and 
middleware  make  decisions  by  estimating  as  accurately  as 
possible,  or  to  the  desired  degree  of  accuracy,  the  cost  of 
certain  operating  system  services  especially  when  dynam¬ 
ically  sharing  resources.  In  view  of  this,  the  proposed  re¬ 
search  can  address  all  issues  covered  by  SCPlib.  Although 
load  balancing  is  not  in  the  scope  of  our  research,  the  use 
of  the  cost  model  can  greatly  facilitate  a  sophisticated  load 
balancing  mechanism  by  providing  the  cost  of  various  task 
placement  options  when  deciding  how  to  redistribute  the 
load  and  not  only  the  cost  of  communications.  Also,  SCPlib 
is  based  on  message-passing  whereas  we  support  DSM  and 
mixed  mode  computations. 

Another  research  work,  which  has  common  ground  with 
the  proposed  research,  is  that  by  Kravets  et  al.  [15].  This 
work  proposes  a  cooperative  solution  to  the  dynamic  man¬ 
agement  of  communication  resources.  In  this  solution,  ap¬ 
plication  requirements,  expressed  in  the  form  of  “payoff” 
functions,  and  network  resource  availability,  expressed  in 
the  form  of  service  availability  curves,  are  taken  into  ac¬ 
count  by  a  configurable  communication  layer.  The  goal  is  to 
better  exploit  communication  resources  by  allowing  appli¬ 
cations  and  networks  to  adapt  to  each  other.  Our  approach 
has  as  ultimate  goal  to  allow  applications  to  self-adapt  to 
the  changing  availability  of  all  system  resources  and  doesn’t 
focus  specifically  on  the  communication  layer. 

7  Summary 

Modem  clusters  connected  by  high-speed  networks  are 
capable  of  outperforming  supercomputers,  but  the  perfor¬ 
mance  delivered  to  scientific  applications  is  only  a  frac¬ 
tion  of  this  maximum.  Identifying  software  overhead  as 
a  key  reason  for  this  discrepancy,  we  have  described  our 
approach  to  solving  this  problem.  We  focus  on  an  hardware 
environment  with  hierarchically  organized  clusters  of  com¬ 
puting  nodes.  In  this  environment  we  form  logical  groups 
of  nodes  by  using  software  DSM  and  multithreading,  and 
augment  the  capabilities  of  the  operating  system  with  dy¬ 
namic  resource  sharing  primitives  and  a  decision-making 
framework  based  on  a  parametrical  cost/benefit  model.  The 
model  works  for  a  variety  of  situations  with  data  availability 
ranging  from  minimal  to  high.  Our  goal  is  to  predict  the  per¬ 
formance  ramifications  of  policy  choices  and  thus  to  allow 


applications  and  middleware  to  adapt  to  their  computing 

environment  and  achieve  better  overall  performance. 
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Abstract 

Metacomputing  frameworks  have  received  renewed 
attention  of  late,  fueled  both  by  advances  in  hardware  and 
networking ,  and  by  novel  concepts  such  as  computational 
grids.  HARNESS  is  an  experimental  metacomputing 
system  based  upon  the  principle  of  dynamic 
reconfigurability  not  only  in  terms  of  the  computers  and 
networks  that  comprise  the  virtual  machine ,  but  also  in  the 
capabilities  of  the  VM  itself  These  characteristics  may  be 
modified  under  user  control  via  a  " plug-in”  mechanism 
that  is  the  central  feature  of  the  system .  The  system's 
capabilities  have  been  used  to  develop  a  PVM 
compatibility  suite,  i.e.  a  set  of  plug-ins  that  allow  users  to 
run  PVM  applications  of  top  of  HARNESS.  In  this  paper 
we  describe  the  PVM-Proxy  plug-in :  a  plug-in  capable  of 
gluing  PVM  applications  to  distributed  object 
environments. 


1  Introduction 

HARNESS  [1]  is  a  metacomputing  framework  that  is 
based  upon  several  experimental  concepts,  including 
dynamic  reconfigurability  and  fluid,  extensible,  virtual 
machines.  The  underlying  motivation  behind  HARNESS 
is  to  develop  a  metacomputing  platform  for  the  next 
generation,  incorporating  the  inherent  capability  to 
integrate  new  technologies  as  they  evolve.  The  first 
motivation  is  an  outcome  of  the  perceived  need  in 
metacomputing  systems  to  provide  more  functionality, 
flexibility,  and  performance,  while  the  second  is  based 
upon  a  desire  to  allow  the  framework  to  respond  rapidly  to 
advances  in  hardware,  networks,  system  software,  and 
applications.  Both  motivations  are,  in  some  part,  derived 
from  our  experiences  with  the  PVM  [2]  system,  whose 
monolithic  design  implies  that  substantial  re-engineering 
is  required  to  extend  its  capabilities  or  to  adapt  it  to  new 
network  or  machine  architectures. 

HARNESS  attempts  to  overcome  the  limited  flexibility 


of  traditional  software  systems  by  defining  a  simple  but 
powerful  architectural  model  based  on  the  concept  of  a 
software  backplane.  The  HARNESS  model  is  one  that 
consists  primarily  of  a  kernel  that  is  configured,  according 
to  user  or  application  requirements,  by  attaching  “plug-in” 
modules  that  provide  various  services.  Some  plug-ins  are 
provided  as  part  of  the  HARNESS  system,  while  others 
might  be  developed  by  individual  users  for  special 
situations,  while  yet  other  plug-ins  might  be  obtained  from 
third-party  repositories.  By  configuring  a  HARNESS 
virtual  machine  using  a  suite  of  plug-ins  appropriate  to  the 
particular  hardware  platform  being  used,  the  application 
being  executed,  and  resource  and  time  constraints,  users 
are  able  to  obtain  functionality  and  performance  that  is 
well  suited  to  their  specific  circumstances.  Furthermore, 
since  the  HARNESS  architecture  is  modular,  plug-ins  may 
be  developed  incrementally  for  emerging  technologies 
such  as  faster  networks  or  switches,  new  data  compression 
algorithms  or  visualization  methods,  or  resource  allocation 
schemes  -  and  these  may  be  incorporated  into  the 
HARNESS  system  without  requiring  a  major  re¬ 
engineering  effort. 

HARNESS’  reconfiguration  capabilities  allowed  us  to 
design  and  implement  a  PVM  compatibility  suite,  i.e.  a  set 
of  plug-ins  that  emulate  the  services  provided  by  PVM 
demons  and  allows  users  to  run  unchanged  PVM 
applications  on  top  of  HARNESS.  The  native  distributed 
object  programming  model  of  HARNESS  on  which  the 
compatibility  suite  is  based  has  been  leveraged  to 
introduce  a  high  level  of  modularity  in  the  design  of  the 
compatibility  suite.  This  modularity  allows  introducing 
new  technologies  and  new  services  into  PVM  without 
requiring  a  complete  redesign  of  the  demon  itself. 

However,  the  compatibility  suite  only  allows  execution 
of  traditional  message  passing  PVM  applications  that 
cannot  directly  take  advantage  of  the  underlying 
distributed  object  infrastructure,  while  our  perceived  goal 
was  to  provide  a  seamless  connection  for  PVM  users  to 
distributed  objects  technology. 
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Figure  1  Abstract  model  of  a  HARNESS  Distributed  Virtual  Machine 


In  this  paper  we  describe  the  PVM-Proxy  plug-in,  a 
HARNESS  plug-in  that  is  able  to  bridge  the  gap  between 
traditional  PVM  applications  and  HARNESS  native 
distributed  object  infrastructure.  This  plug-in  appears  as  a 
standard  PVM  task  to  any  running  PVM  application,  but  is 
able  to  interface  directly  to  distributed  objects  applications 
translating  messages  coming  from  the  PVM  side  into  RMI 
or  CORBA  procedure  calls  and  procedure  calls  coming 
from  the  distributed  object  applications  back  into  PVM 
messages.  This  plug-in  can  be  used  to  make  complete 
distributed  object  applications  such  as  our  reusable 
simulation  framework  [3]  appear  as  PVM  tasks  to  PVM 
applications,  actually  allowing  traditional  PVM 
applications  to  take  complete  advantage  of  the  capabilities 
of  the  HARNESS  system. 

The  full  paper  will  be  structured  as  follows:  in  section  2 
we  will  give  a  detailed  description  of  the  HARNESS 
metacomputing  framework  and  of  its  main  architectural 


features;  in  section  3  we  will  describe  the  HARNESS 
PVM  compatibility  suite;  in  section  4  we  will  describe  the 
architectural  and  implementation  details  of  the  HARNESS 
PVM-Proxy  plug-in;  in  section  5  we  will  show  how  our 
system  can  be  used  to  glue  together  heterogeneous 
applications  using  a  distributed  MPEG  coder  as  an 
example  application;  finally,  in  section  6  we  will  provide 
some  concluding  remarks. 

2  HARNESS  System  Architecture 

The  fundamental  abstraction  in  the  HARNESS 
metacomputing  framework  is  the  Distributed  Virtual 
Machine  (DVM)  (see  figure  1,  level  1).  Any  DVM  is 
associated  with  a  symbolic  name  that  is  unique  in  the 
HARNESS  name  space,  but  has  no  physical  entities 
connected  to  it.  Heterogeneous  Computational 
Resources  may  enroll  into  a  DVM  (see  figure  1,  level  2) 
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at  any  time,  however  at  this  level  the  DVM  is  not  ready 
yet  to  accept  requests  from  users.  To  get  ready  to  interact 
with  users  and  applications  the  heterogeneous 
computational  resources  enrolled  in  a  DVM  need  to  load 
plug-ins  (see  figure  1,  level  3).  A  plug-in  is  a  software 
component  implementing  a  specific  service.  By  loading 
plug-ins  a  DVM  can  build  a  consistent  service  baseline 
(see  figure  1,  level  4).  Users  may  reconfigure  the  DVM  at 
any  time  (see  figure  1,  level  4)  both  in  terms  of 
computational  resources  enrolled  by  having  them  join  or 
leave  the  DVM  and  in  terms  of  services  available  by 
loading  and  unloading  plug-ins. 

The  main  goal  of  the  HARNESS  metacomputing 
framework  is  to  achieve  the  capability  to  enroll 
heterogeneous  computational  resources  into  a  DVM  and 
make  them  capable  of  delivering  a  consistent  service 
baseline  to  users.  This  goal  require  the  programs  building 
up  the  framework  to  be  as  portable  as  possible  over  an  as 
large  as  possible  selection  of  systems.  The  availability  of 
services  to  heterogeneous  computational  resources  derives 
from  two  different  properties  of  the  framework:  the 
portability  of  plug-ins  and  the  presence  of  multiple 
searchable  plug-in  repositories.  HARNESS  implements 
these  properties  mainly  leveraging  two  different  features 
of  Java  technology.  These  features  are  the  capability  to 
layer  a  homogeneous  architecture  such  as  the  Java  Virtual 
Machine  (JVM)  [4]  over  a  large  set  of  heterogeneous 
computational  resources,  and  the  capability  to  customize 
the  mechanism  adopted  to  load  and  link  new  objects  and 
libraries.  However,  the  adoption  of  the  Java  language  as 
the  development  platform  for  the  HARNESS 
metacomputing  framework  has  given  us  several  other 
advantages: 

•  it  allowed  us  to  develop  the  framework  as  a  collection 
of  cooperating  objects  with  consistent  boundaries 
(Java  Classes)  and  to  guarantee  to  users  an  00 
development  environment; 

•  it  allowed  us  to  define  a  clear  and  consistent  boundary 
for  plug-ins,  in  fact  each  plug-in  is  required  to  appear 
to  the  system  as  a  Java  class; 

•  it  allowed  us  to  implement  all  the  entities  in  the 
framework  adopting  a  robust  multithreaded 
architecture; 

•  it  allows  users  to  develop  additional  services  both  in  a 
passive,  library-like  flavor  and  in  an  active  thread- 
enabled  flavor; 

•  it  provided  us  an  Object  Oriented  mechanism  to 
require  services  from  remote  computational  resources 
(Java  Remote  Method  Invocation  [5]); 

•  it  provided  us  a  generic  methodology  to  transfer  data 
over  the  network  in  a  consistent  format  (Java  Object 
Serialization  [6]); 

•  it  allowed  us  to  provide  to  users  the  definition  of 

interfaces  to  be  implemented  by  plug-ins 

implementing  the  basic  services; 


•  it  allowed  us  to  tune  the  trade-off  between  portability 
and  efficiency  for  the  different  components  of  the 
framework. 

This  last  capability  is  extremely  important,  in  fact, 
although  portability  at  large  is  needed  in  all  the 
components  of  the  framework,  it  is  possible  to  distinguish 
three  different  categories  among  the  components  that 
requires  different  level  of  portability.  The  first  category  is 
represented  by  the  components  implementing  the 
capability  to  manage  the  DVM  status  and  load  and  unload 
services.  We  call  these  components  kernel  level  services. 
These  services  require  the  highest  achievable  degree  of 
portability,  as  a  matter  of  fact  they  are  necessary  to  enroll 
a  computational  resource  into  a  DVM.  The  second 
category  is  represented  by  very  commonly  used  services 
(e.g.  a  general,  network  independent,  message  passing 
service  or  a  generic  event  notification  mechanism).  We 
call  these  services  basic  services.  Basic  services  should  be 
generally  available,  but  it  is  conceivable  for  some 
computational  resources  based  on  specialized  architecture 
to  lack  them.  The  last  category  is  represented  by  highly 
architecture  specific  services.  These  services  include  all 
those  services  that  are  inherently  dependent  on  the  specific 
characteristics  of  a  computational  resource  (e.g.  a  low- 
level  image  processing  service  exploiting  a  SIMD  co¬ 
processor,  a  message  passing  service  exploiting  a  specific 
network  interface  or  any  service  that  need  architecture 
dependent  optimization).  We  call  these  services 
specialized  services.  For  this  last  category  portability  is  a 
goal  to  strive  for,  but  it  is  acceptable  that  they  will  be 
available  only  on  small  subsets  of  the  available 
computational  resources.  These  different  degrees  of 
required  portability  and  efficiency  over  heterogeneous 
computational  resources  can  optimally  leverage  the 
capability  to  link  together  Java  byte  code  and  system 
dependent  native  code  enabled  by  the  Java  Native 
Interface  (JNI)  [7].  The  JNI  allows  to  develop  the  parts  of 
the  framework  that  are  most  critical  to  efficient  application 
execution  in  ANSI  C  language  and  to  introduce  into  them 
the  desired  level  of  architecture  dependent  optimization  at 
the  cost  of  increased  development  effort  [8]  [9]. 

The  use  of  native  code  requires  a  different 
implementation  of  a  service  for  each  type  of 
heterogeneous  computational  resource  that  need  to  deliver 
that  service.  This  fact  implies  a  development  effort 
multiplied  for  each  plug-in  including  native  code. 
However,  if  a  version  of  the  plug-in  for  a  specific 
architecture  is  available,  the  HARNESS  metacomputing 
framework  is  able  to  fetch  and  load  it  in  a  user  transparent 
fashion,  thus  users  are  screened  from  the  necessity  to 
control  the  set  of  architectures  their  application  is 
currently  running  on.  To  achieve  this  result  HARNESS 
leverages  the  capability  of  the  JVM  to  let  users  redefine 
the  mechanism  used  to  retrieve  and  load  both  Java  classes 
bytecode  and  native  shared  libraries.  In  fact,  each  DVM  in 
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the  framework  is  able  to  search  a  set  of  plug-ins 
repositories  for  the  desired  library.  This  set  of  repositories 
is  dynamically  reconfigurable  at  run-time,  users  can  add 
new  repositories  at  any  time. 

The  kernel  level  services  of  a  Harness  DVM  are 
delivered  by  a  distributed  system  composed  of  two 
categories  of  entities: 

•  a  DVM  status  server,  unique  for  each  DVM; 

•  a  set  of  Harness  kernels,  one  and  only  one  running  on 
each  computational  resource  currently  enrolled  or 
willing  to  be  enrolled  into  a  DVM. 

To  achieve  the  highest  possible  degree  of  portability  for 
the  kernel  level  services  both  the  kernel  and  the  DVM 
status  server  are  implemented  as  pure  Java  programs.  We 
have  used  the  multithreading  capability  of  the  Java  Virtual 
Machine  to  exploit  the  intrinsic  parallelism  of  the  different 
tasks  the  two  entities  have  to  perform,  and  we  have  built 
the  framework  as  a  set  of  Java  packages. 

Control  messages  and  DVM  status  changes  not  related 
to  the  discovery-and-join  protocol  or  the  recover-from- 
failure  protocol,  are  exchanged  through  a  star  shaped  set 
of  reliable  unicast  channels  whose  center  is  the  DVM 
status  server.  These  connections  are  implemented  through 
the  communication  commodities  delivered  by  the  java.net 
package.  It  is  important  to  notice  that  neither  the  star 
topology,  nor  the  use  of  the  java.net  package  are 
constraints  imposed  to  all  the  communication  services  in 
the  framework.  On  the  contrary,  user  level  communication 
services  may  adopt  the  connection  topology  that  best  suit 
their  needs  and  are  not  required  to  use  the  java.net  package 
to  implement  these  commodities.  For  this  reason,  neither 
the  star  topology  interconnecting  the  kernels  and  the  DVM 
Server,  nor  the  fact  that  the  java.net  package  is  used 
represent  a  major  bottleneck  in  the  Harness 
metacomputing  framework.  The  kernels  and  the  DVM 
server  interacts  to  guarantee  a  consistent  evolution  of  the 
status  of  the  DVM  both  in  front  of  users  requesting  new 
services  to  be  added  and  in  front  of  computational 
resources  or  network  failures.  This  consistency  is  enforced 
by  means  of  a  set  of  protocols  executed  during  the 
different  phases  of  the  DVM  life. 

A  DVM  may  be  started  in  three  different  ways: 

•  starting  a  DVM  server; 

•  starting  a  kernel; 

•  starting  an  application. 

In  the  first  case,  a  user  invokes  the  execution  of  the 
main  method  of  the  Java  H_Server  class  from  the 
edu.emory.mathcs.harness  package  providing  as  a 
parameter  the  name  of  the  DVM  this  server  is  starting.  The 
DVM  server  reads  the  configuration  file  harness.defaults 
to  see  if  the  user  wants  to  use  server  based  implementation 
or  a  multicast  implementation  of  the  HARNESS  name 
space.  In  the  former  case  the  server  gets  from  the  same 
configuration  file  the  port  and  address  of  the  HARNESS 
name  server  the  user  wants  to  adopt  and  connects  to  it  to 


register  its  presence.  In  the  latter  case  it  executes  a  hashing 
function  to  map  the  DVM  name  into  a  multicast  IP  address 
and  port.  Then  it  starts  to  multicast  on  the  channel  I’m 
alive  packets  and  to  listen  for  incoming  packets. 

In  name-server  mode  the  DVM  server  can  get  two  types 
of  packets: 

•  probe  requests  from  the  name-server; 

•  join  requests  from  kernels. 

Probe  request  are  sent  by  the  naming  service  every  time  it 
is  requested  to  provide  information  about  a  DVM  server. 
Before  sending  it’s  current  data  the  name  server  validates 
them  with  a  probe  message.  If  the  server  receives  a  join 
packet  then  it  generates  a  TCP  connection  to  the  sender 
kernel  and  it  starts  the  Join  protocol. 

In  multicast  mode  the  DVM  server  can  get  three  types 
of  packets: 

•  I’m  alive  packets  from  a  DVM  server; 

•  join  packets  from  kernels; 

•  query  packets  from  applications. 

The  server  checks  the  source  address  of  any  I’m  alive 
packet  it  receives.  If  the  packet  comes  from  another  server 
the  server  multicasts  a  train  of  I’m  alive  packets  to  notify 
its  presence  to  the  other  server  and  then  it  exits.  This  will 
enforce  the  kernels  running  on  computational  resources 
enrolled  in  the  DVM  to  start  the  server  regeneration 
protocol  and  to  regenerate  a  new,  single  server.  This 
mechanism  prevents  the  existence  of  multiple  DVM 
servers  with  partial  or  outdated  information  and  guarantees 
that  a  single  DVM  server  is  active  in  a  DVM. 

If  the  server  receives  a  join  packet  then  it  generates  a 
TCP  connection  to  the  sender  kernel  and  it  starts  the  Join 
protocol. 

If  the  server  receives  a  query  packet  then  it  checks  if  a 
kernel  exists  on  the  computational  resource  from  which 
the  application  is  querying.  If  a  kernel  is  already  active, 
then  the  server  provides  to  the  querying  application  the 
port  number  on  which  the  kernel  accepts  connections  from 
applications,  otherwise  it  provides  a  null  reply. 

The  second  way  to  start  a  Harness  DVM  is  to  invoke 
the  main  method  of  the  Main  class  in  the 
edu.emory.mathcs.harness  package  providing  as  a  startup 
parameter  the  name  of  the  DVM  the  kernel  wants  to  enroll 
into.  The  kernel  reads  the  configuration  file 
harness.defaults  to  check  if  the  user  wants  to  use  server 
based  implementation  or  a  multicast  implementation  of  the 
HARNESS  name  space.  In  the  former  case  the  kernel  gets 
from  the  same  configuration  file  the  port  and  address  of 
the  HARNESS  name  server  the  user  wants  to  adopt,  it 
connects  to  it  and  asks  if  there  is  a  DVM  server  for  the 
DVM  it  wants  to  join.  If  there  is  one  it  sends  a  join  request 
to  it,  if  there  is  none  it  starts  the  DVM  regeneration 
protocol. 

In  the  latter  case,  the  kernel  executes  the  hash  function 
to  map  the  DVM  name  into  an  IP  multicast  address  and 
port  and  sends  send  a  Join  packet  on  that  channel.  The 
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kernel  performs  three  tries  before  giving  up.  After  three 
tries  have  timed  out  without  a  DVM  server  activating  a 
TCP  connections  the  kernel  assume  no  DVM  server  exists 
and  spawns  a  new  JVM  to  start  a  new  DVM  server.  Then 
he  starts  again  sending  the  Join  packet. 

The  third  way  to  start  a  Harness  DVM  is  to  instantiate 
the  class  H_core  or  H_RMIcore  from  the  package 
edu.emory.mathcs.harness  in  an  application  providing  the 
DVM  name  as  a  parameter.  The  H„RMIcore  class 
constructor  hashes  the  DVM  name  to  obtain  a  port  for  the 
HARNESS  RMI  registry.  The  HARNESS  RMI  registry 
provides  to  the  H_RMIcore  class  an  RMI  reference  for  the 
local  kernel.  If  it  cannot  connect  to  that  port,  the 
H_RMIcore  class  assumes  that  no  kernel  for  the  given 
DVM  is  active  on  the  local  host  and  starts  a  new  one. 

The  H_core  class  constructor  executes  the  hashing 
function  and  drops  a  query  packet  on  the  multicast 
channel.  If  no  answer  comes  back  or  if  the  answer  says 
that  no  kernel  is  active  on  the  computational  resource  the 
constructor  spawns  a  new  JVM  starting  a  kernel  and  sets  a 
flag  to  avoid  starting  a  new  one  even  in  the  case  of  another 
failed  set  of  tries.  The  possibility  of  two  or  more 
applications  racing  to  spawn  to  or  more  kernels  on  the 
same  computational  resource  is  prevented  by  the  Join 
protocol. 

The  DVM  server  initiates  the  join  protocol  each  time  it 
receives  a  multicast  join  packet.  The  Join  packet  contains 
the  IP  address  and  a  port  number  onto  which  the  willing- 
to-join  kernel  is  accepting  a  TCP  connection.  The  first  step 
of  the  join  protocol  is  the  instantiation  of  a  TCP 
connection  between  the  DVM  server  and  the  Joining 
kernel.  Then  the  DVM  server  waits  for  the  kernel  to 
provide  its  baseline.  At  this  point  the  server  performs  two 
checks:  the  baseline  check  and  the  uniqueness  check.  The 
baseline  check  consists  of  checking  the  compatibility  of 
the  kernel  with  the  current  implementation  of  the  DVM 
server.  The  uniqueness  check  consists  of  checking  that  no 
other  kernel  has  already  joined  from  the  same 
computational  resource.  In  case  of  failure  of  one  of  these 
two  checks  an  error  message  is  sent  back,  the  protocol 
terminates  with  a  failure  and  the  connection  is  closed.  If 
the  kernel  passes  both  controls  then  the  DVM  servers 
checks  if  the  kernel  is  Joining  back  after  a  failure 
(computational  resource  or  network  crash)  or  if  the 
computational  resource  has  never  been  enrolled  in  the 
DVM  before.  If  the  computational  resource  is  coming 
back  from  a  crash  the  DVM  server  sends  to  the  kernel  a 
crash  token  message  and  a  copy  of  its  pre-crash  status, 
otherwise  it  sends  a  new  token  message.  The  following 
step  is  to  get  from  the  kernel  its  current  status  and  to  send 
back  to  it  the  current  status  of  the  DVM. 

At  this  point  the  Join  protocol  is  successfully 
completed,  the  DVM  server  generates  a  Join  event  that  is 
distributed  as  described  in  next  section  while  the  kernel  is 
now  enrolled  in  the  DVM. 


The  leave  protocol  is  much  simpler  that  the  Join  protocol. 
The  leave  protocol  is  always  started  by  a  kernel.  A  TCP 
connection  between  the  kernel  is  guaranteed  to  be  active, 
as  a  matter  of  fact  it  is  not  possible  to  start  the  Leave 
protocol  before  a  successful  completion  of  the  Join 
protocol.  The  kernel  sends  an  explicit  Leave  message  to 
the  DVM  server  and  then  closes  the  TCP  connection.  The 
DVM  server  generates  a  Leave  event  that  is  distributed  to 
all  the  remaining  kernels. 

The  status  of  the  DVM  consists  of  the  set  of 
computational  resources  currently  enrolled  in  the  DVM, 
the  set  of  services  available  on  each  enrolled 
computational  resource  as  well  as  the  DVM’s  baseline. 
We  call  baseline  of  a  DVM  the  minimum  set  of  services  a 
computational  resource  must  be  able  to  deliver  in  order  to 
join  the  DVM.  The  dynamic  nature  of  the  framework 
make  this  state  an  evolving  entity,  thus  the  framework 
keeps  it  up  to  date  and  available  for  queries  from  any 
application  or  service  in  the  DVM.  It  is  important  to  notice 
that  information  about  the  applications  currently  using 
services  or  internal  status  of  an  application  is  not  part  of 
the  DVM  status  and  loosing  track  of  it  does  not  in  any  way 
compromise  the  existence  of  the  DVM  in  itself.  Any  form 
of  application  tracking  and  check-pointing,  while  highly 
desirable  for  many  applications,  is  a  service  in  itself  and 
the  framework  does  not  need  to  incorporate  it  in  its  status. 

The  Harness  metacomputing  framework  guarantees  that 
all  the  events  that  changes  the  status  of  the  DVM  are 
received  by  all  the  kernels  enrolled  in  the  DVM  in  the 
same  order.  In  the  current  implementation  the  Total  Order 
(TO)  protocol  is  implemented  adopting  the  DVM  server  as 
a  central  ordering  entity  and  exploiting  the  stream  nature 
of  TCP  connections  to  avoid  subsequent  losses  of  order. 
Although  very  simple,  a  centralized  implementation  of  the 
TO  protocol  has  in  general  two  negative  features: 

•  the  central  entity  is  a  single  point  of  failure; 

•  the  central  entity  is  a  bottleneck. 

However,  these  two  problems  do  not  represent  a  major 
flaw  in  the  design  and  efficiency  of  our  framework.  In 
fact,  the  single  point  of  failure  is  limited  to  the  incapability 
of  the  framework  to  retrieve  after  a  DVM  server  crash  the 
status  of  a  previously  crashed  kernel  and  the  central 
bottleneck  does  not  influences  application  level 
communication  services.  The  status  of  a  DVM  as  it  is 
defined  in  the  Harness  metacomputing  framework  consists 
of  the  sum  of  the  stati  of  each  enrolled  kernel.  Each  event 
that  changes  the  status  of  the  DVM  changes  the  status  of  a 
kernel  in  a  way  that  is  recorded  by  the  kernel  itself  with 
the  only  exception  being  the  case  of  a  kernel  crash.  Thus  it 
is  not  possible  for  an  event,  except  for  kernel  crash  events, 
to  get  lost  in  a  DVM  server  crash.  On  the  contrary,  in  the 
case  of  a  DVM  server  crash  it  is  possible  to  reconstruct 
completely  the  current  status  of  the  DVM  simply 
obtaining  from  every  surviving  kernel  a  copy  of  its  current 
status. 
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User  Setup 

public  int  login ( j ava . lang . String ,  j ava . lang . String) 
public  int  logout () 

public  boolean  setCResourceMapping (  H_pname) 
public  boolean  setServiceMapping (  H_pname) 

DVM  Manipulation 

public  java. lang. String  getName ( ) 

public  H_StringKeyedTable  getArchs ( ) 

public  H_crname  getAHosts ( ) [ ] 

public  java . util . Enumeration  getHostsO 

public  H_RetVal  grabHost (  H_crname [ ] ,  H_QoS) 

public  H_RetVal  deleteHost(  H_crname [ ] ,  H_QoS) ; 

public  void  kill() 

Plug-ins  Manipulation 

public  H_RetVal  getlnterf aceDescriptor (H_phandle) 
public  H_RetVal  load (  H_pname,  H^crname [ ] ,  H_QoS) 
public  H_RetVal  unload (  H_phandle [ ] ,  H_QoS) 

Information  Gathering 

public  H_Info  getlnfoO 

public  java .util .LinkedList  getAll {  H_pname) 

public  java. util .LinkedList  getAll {  H_pname,  H_crname) 

public  java .util .LinkedList  getAll ( java . lang . Class ) 

public  java. util .LinkedList  getAll ( j ava . lang . Class ,  H_crname) 

public  H_phandle  getAny(  H_pname) 

public  H_phandle  getAny(  H_pname#  H_crname) 

public  H_phandle  getAny ( java . lang . Class ) 

public  H_phandle  getAny ( java . lang . Class ,  H_crname) 

public  java. util .LinkedList  get Plugins (H_crname) 


Figure  2 


Functional  interface  provided  by  the  Java  class  H_RMIcore. 


It  is  important  to  notice  that  the  fact  that  this 
reconstruction  process  is  not  able  to  keep  track  of  crashed 
kernels  does  not  mean  that  applications  relying  on  services 
delivered  by  the  crashed  kernels  will  have  as  their  only 
choice  to  stop  and  fail.  Reliable  distributed  check-pointing 
of  application’s  status  and  restart  of  failing  services  are 
services  themselves,  thus  their  behavior  in  the  event  of 
kernel  crashes  is  not  constrained  by  the  DVM  status  and 
the  reconstruction  of  the  DVM  status  is  not  concerned 
with  them. 

To  evaluate  the  bottleneck  represented  by  the  star 
topology,  it  is  important  to  notice  that  it  involves  only 
events  requiring  DVM  status  changes,  as  a  matter  of  fact 
any  traffic  generated  by  user  application  exchanging  data 
is  not  required  to  flow  through  the  DVM  server.  The  only 
events  that  the  DVM  status  server  needs  to  process  are: 

•  a  kernel  joining  the  DVM; 

•  a  kernel  leaving  the  DVM; 

•  a  kernel  crash; 


•  the  addition  of  a  service  to  the  DVM. 

Thus  the  DVM  server  represents  only  a  marginal 
bottleneck  in  the  Harness  metacomputing  framework. 

The  current  release  of  HARNESS  provides  4 
mechanisms  for  applications  to  interact  with  a  HARNESS 
kernel: 

•  the  H_RMIcore  Java  class  that  provides  a  set  of 
fully  object  oriented  methods  and  communicates 
with  the  kernel  by  means  of  RMI; 

•  the  H_core  Java  class  that  provides  the  same 
functional  interface  as  the  H_RMIcore  class  on  top 
of  a  string  oriented  protocol; 

•  a  C  library  that  exploits  the  JNI  to  invoke  the 
methods  of  the  above  mentioned  Java  class; 

•  a  language  independent,  string  oriented  protocol  on 
top  of  a  TCP  reliable  connection. 

In  figure  2  you  can  see  the  functional  interface  provided 
by  the  Java  class  H_RMIcore.  The  functions  can  be 
divided  in  four  groups:  user  setup,  DVM  manipulation. 
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H_USER 

username 

password 

H_ISROOTLIKE  <present  if  this  user  is  equivalent  to  root  absent  otherwise> 
HJjOADABLECLASSES 

<classname  or  packagename  as  in  import  statement  in  java  files> 

H_ENDLOADABI£CLASSES 

H_ACCESSIBLEPLUGINS 

<classname  or  packagename  as  in  irnport  statement  in  java  files> 

H_ENDACCESSEBLEPLUGINS 
H_REPOSITORIES 
repository  URL> 

H_ENDREPOSITORIES 

H_ENDUSER 


Figure  3  Syntax  of  the  harness.policy  file. 


plug-ins  manipulation  and  information  gathering.  A  user 
need  to  log  into  the  DVM  to  be  allowed  performing  any 
other  operation.  The  DVM  stores  the  couple  user  name- 
password  so  that  the  same  user  will  be  recognized  at  log-in 
independently  from  the  kernel  he  is  loggin-in  from.  A  user 
can  set  up  a  resource  mapper  service  and  a  service  mapper 
service.  The  resource  mapper  service  performs  a 
translation  from  a  user-defined  naming  scheme  into  the 
HARNESS  naming  scheme  for  the  names  of  the 
computational  resources.  The  service  mapper  service 
performs  a  translation  from  a  user-defined  naming  scheme 
into  the  HARNESS  naming  scheme  for  the  names  of  the 
plug-ins.  Thus  it  is  possible  to  build  user-defined  naming 
schemes  on  top  of  the  basic  HARNESS  naming  scheme 
both  for  computational  resources  and  for  plug-ins.  The 
only  constraint  is  the  need  for  a  complete  mapping  from 
the  user-defined  scheme  to  the  HARNESS  scheme.  As  an 
example,  we  developed  a  simple  resource  mapper  that  is 
able  to  translate  architecture-names  into  instances  of  that 
architecture  available  at  Emory.  Once  logged-in  a  user  can 
access  all  the  other  functionality  provided  to  grab  and 
remove  hosts  from  the  DVM,  load  and  unload  plug-ins 
and  query  the  status  of  the  DVM.  The  capability  to  request 
a  service  (e.g.  deleting  a  host)  does  not  imply  that  the 
system  will  fulfill  the  request,  in  fact  every  HARNESS 


kernel  is  configured  at  bootstrap  with  security  options. 
These  options  define: 

•  which  user  is  the  root  user  for  the  local  kernel; 

•  if  root  user  access  is  required  to  force  the  kernel  to 
leave  a  DVM; 

•  which  plug-ins  each  user  can  load; 

•  which  plug-ins  each  user  can  access; 

•  which  repositories  the  kernel  can  retrieve  plug-ins 
from. 

It  is  important  to  notice  that  each  kernel  can  set  a  different 
set  of  users  who  have  root  access  to  it.  Thus,  it  is  possible 
both  to  have  a  global  system  administrator  for  clusters 
owned  by  a  single  entity  and  to  have  a  different 
administrator  for  each  node  of  a  DVM  composed  of 
personal  workstations. 

The  sets  of  plug-ins  loadable  and  accessible  for  each  user 
are  defined  through  a  security  configuration  file,  namely 
the  harness.policy  file.  In  figure  3  you  can  see  the  syntax 
of  the  harness.policy  file,  while  in  figure  4you  can  see  an 
example  instance  of  it.  If  a  user  is  not  explicitly  cited  in 
the  policy  file  he  is  associated  by  default  to  user 
NOBODY.  Thus  it  is  possible  to  establish  a  minimum 
access  level  for  anonymous  users  by  means  of  the  user  ID 
NOBODY.  A  plug-in  can  be  unloaded  only  by  the  user 
who  loaded  it  or  by  the  local  root  user.  Thus  it  is  not 
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HJJSER 

MAURO 

Harness 

HJSROOTUKE 
HJjOADABLECLASSES 
edu.emory.  mathcs.hamess.  * 
helloHamess.* 
cgrowth4.* 

H_ENDLOADABLECLASSES 

H_ACCESSIBLEPLUGINS 

edu.emory.mathcs.hamess.  * 

helloHamess.* 

cgrowth4.* 

H_ENDACCESSIBLEPLUGINS 

^REPOSITORIES 

http://www.mathcs.enx)iy.edu/hamess/REPOSITORY/ 
http://www.  mathcs.emory.edu/~onVREPOSrTORY/ 
H_ENDREPOSITORIES 
H_ENDUSER 

H_USER 

NOBODY 

NOBODY 

H_LOADABLECLASSES 
edu.emory.  mathcs.hamess.  * 
helloHamess.* 

H_ENDLOADABLECLASSES 

H_ACCESSIBLEPLUGINS 

edu.emory.mathcs.hamess.  * 

helloHamess.* 

cgrowth4.* 

H_ENDACCESSIBLEPLUGINS 

H_REPOSITORIES 

http://www.mathcs.emory.edu/hamess/REPOSITORY/ 

H_ENDREPOSITORIES 

H_ENDUSER 

Figure  4  An  example  harness.policy  file. 


possible  for  a  non-root  user  to  remove  a  plug-in  that  is  part 
of  another  user’s  application. 

3  The  Design  of  a  PVM  plug-in  for  a 
HARNESS  DVM 

The  design  of  our  PVM  compatibility  suite  had  three 
main  objectives: 


•  requiring  no  changes  into  PVM  applications  to  run  in 
the  new  environment; 

•  minimizing  the  amount  of  changes  to  be  inserted  in  the 
application  side  PVM  library; 

•  achieving  a  modular  design  for  the  services  provided 
by  the  PVM  daemon. 

We  achieved  these  goals  by  designing  a  set  of  plug-ins 
able  to  understand  the  original  PVM  library  to  PVM 
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R  MI  calls  from  other  demons 


Jpvmd  interface 

Jpvmd_m  interface: 

JpmicLs.  interface. 

. 

. ( 

:epts  connections  from  tasks 


LOAD  FRC>M  HARNESS  SIDE 
4 


To  and  from  tasks  / 

[-Socket )  -  { -Socket-}- . -f-Soeket-Y  - 


PVMD  start-up 

checks  all  necessary  generic  plug-:  ns 
are  present.  Loads  them  if  necessar 
blit  system,  start  other  plugin  campon*  nts 


Listener  thread] 

Listener  thread 

listener  thread 

Accepter  Threads 


Spreads  to  listen  to  accept^d-tiannectionsj 


Local  Update 


PVM  information  manager 
Local  host  table  tracks  local  changed 


Execute  distributed  election  to  regenerate 


Master 


Status  Querying  .. ' 


PVM  Plug-in 


Global  Host  table 
(ONLY  ON  MASTER)) 
Merges  local  host  tables 


Local  Update 


Intexpretate  task  messages 
Request  services  of  cither  phig-^ins 
to  service  these  requestr---^ 


User  Lev  el  Message  Pas  sing  Service 
Upcast  Mu$  cast 


PVM  TID  management  unit 


PVM-TID  based  routing  unit 
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Architecture  of  the  HARNESS  PVM  Plug-in 


daemon  protocol  and  to  duplicate  the  services  provided  by 
the  original  PVM  daemon.  This  approach  provides 
complete  compatibility  with  PVM  legacy  code,  both  in  C 
and  in  FORTRAN,  and  requires  only  one  change  in  the 
PVM  library  on  the  application  side,  namely  the  adoption 
of  internet  domain  sockets  for  the  communication  channel 
between  the  library  and  the  daemon. 

Thus  to  run  a  legacy  PVM  application  in  the  Harness 
PVM  environment  it  is  only  necessary  to  link  the  original 
object  code  with  the  modified  version  of  the  library. 

In  our  implementation,  the  services  provided  by  the 
PVM  daemons  to  applications  are  delivered  by  dedicated 
modules  and  general  purpose  Harness  plug-ins  such  as  a 


process-spawning  plug-in  and  a  message-passing  plug-in 
(see  figure  5).  In  figure  6  we  show  the  actual  sequence  of 
the  events  in  the  Harness  PVM  startup,  while  figure  7 
shows  the  chain  of  events  serving  an  add_host  request.  At 
PVM  startup  a  special  PVM  application  (i.e.  the 
HARNESS-PVM  console)  starts  up  the  PVM  demon  by 
issuing  to  the  Harness  kernel  the  command  to  load  the 
main  PVMD  plug-in.  This  plug-in  takes  care  to  request  the 
Harness  kernel  to  load  the  services  that  are  required  to 
provide  full  PVM  compatibility.  When  a  task  requests  an 
add-host  operation  the  local  PVMD  plug-in  translates  it 
into  a  request  for  the  remote  Harness  kernel  to  load  the 
main  PVMD  plug-in  which  then  takes  care  of  requesting 
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the  loading  of  the  other  needed  plug-ins. 

The  current  version  of  the  plug-in  provides  only  the 
services  required  to  emulate  a  completely  functional 
subset  of  PVM  daemon’s  capabilities.  This  subset 
includes: 

•  all  the  process  control  PVM  commands; 

•  the  pvm_parent,  pvm_tidtohost  and  pvm_error 
information  commands; 

•  all  the  message  buffers  commands; 

•  all  the  point-to-point  sending  commands; 

•  all  the  receive  commands. 

Multicast  and  group  operations  are  currently  not 
supported,  however  the  development  of  these  services  as 
additional  pluggable  module  is  in  progress. 

Direct  routing  is  supported  as  it  completely  bypasses 
the  PVM  demons  and  is  completely  implemented  in  the 
original  PVM  library. 

The  modularity  of  the  design  will  easily  let  us 
substitute  any  plug-in  with  new  versions  in  order  to 
provide  an  enhanced  version  of  the  service.  As  an 
example,  it  will  be  easy  to  load  a  new  version  of  the 
database  plug-in  to  provide  an  extended  system  querying 
capability.  Besides,  the  Harness  capability  to  hot  swap 
services  allows  run-time  tuning  of  services  to  the  set  of 
hosts  enrolled  in  the  virtual  machines,  e.g.  a  specific 
version  of  message-passing  plug-in  can  be  loaded  at  run¬ 
time  if  a  new  communication  fabric  becomes  available. 

This  design  also  has  the  following  additional 
advantages.  The  first  one  derives  from  the  fact  that  the 
message-passing  service  provided  by  the  message-passing 
plug-in  needs  only  to  peek  at  the  destination  field  of  a 
message  in  order  to  route  it  and  does  not  need  to  know 


anything  about  the  actual  content  of  the  message.  This  is 
beneficial  for  two  reasons: 

•  the  marshalling  and  un-marshalling  of  the  data  types  is 
performed  inside  the  Harness  PVM  library  in  C  or 
Fortran  thus  we  don’t  incur  in  the  typical  marshalling 
inefficiency  due  to  the  strong  typedness  of  Java; 

•  it  is  extremely  easy  to  substitute  the  message  passing 
plug-in  with  another  plug-in  optimized  to  a  specific 
communication  fabric  (be  it  a  proprietary  local 
network  or  an  unreliable  Internet  connection)  because 
they  only  need  to  move  arrays  of  bytes. 

Another  benefit  of  our  design  is  the  fact  that  PVM 
applications  can  rely  on  the  Harness  capability  to  soft- 
install  applications  to  move  executable  and  libraries  to  the 
hosts  in  the  VM.  Thus  it  is  not  necessary  to  install  the 
application  and  PVM  itself  on  all  the  hosts  of  the  VM,  the 
Harness  loader  will  do  it  as  long  as  they  are  available  on 
any  host  in  the  Harness  DVM  or  on  any  one  of  the  enlisted 
repositories. 

A  third,  very  important  benefit  of  our  design  is  the 
removal  of  the  single  point  of  failure  represented  by  the 
master  PVM  daemon.  In  fact,  providing  PVM 
compatibility  on  top  of  the  Harness  system  by  means  of  a 
set  of  cooperating  plug-ins,  allowed  us  to  exploit  the 
Harness  event  subscription/notification  service  to 
implement  a  distributed  control  algorithm  in  the 
information  management  plug-ins.  This  algorithm  is 
capable  of  reconstructing  a  consistent,  up-to-date  version 
of  the  PVM  status  after  the  crash  of  any  daemon. 

The  performance  delivered  by  the  HARNESS  PVM 
plug-in  is  still  not  on  par  with  the  original  PVM.  In 
particular,  due  to  the  fact  that  the  PVM  plug-in  shares  the 
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JVM  with  any  other  plug-in  on  the  same  host,  the 
performance  degradation  is  extremely  sensitive  to  the 
other  activities  currently  on-going  in  the  HARNESS 
DVM.  To  cope  with  this  problem  we  are  currently 
studying  a  mechanism  to  force  HARNESS  to  dedicate  a 
complete  JVM  to  the  components  composing  the  PVM 
demon  and  applications.  It  is  our  opinion  that  such  a 
mechanism  will  allow  removing  the  dependency  of  the 
HARNESS  PVM  plug-in  performance  to  external 
applications  without  compromising  the  modularity  and 
expandability  achieved  so  far. 

4  The  PVM-Proxy  plug-in 

The  PVM-Proxy  plug-in  implement  a  generic 
translation  of  PVM  user  messages  into  function  calls 
according  to  a  user-defined  protocol  that  can  be  plugged  in 
the  Proxy  itself  as  a  behavioral  object.  This  arbitrary 
protocol  takes  care  of  interpreting  the  messages  coming 


from  the  PVM  side  in  order  to  generate  the  correct  actions. 
The  simplest  possible  protocol  contains  only  data  packet 
that  need  to  be  processed  by  the  distributed  object 
application  connected  to  the  PVM-Proxy.  The  restrictions 
that  we  need  to  impose  onto  a  PVM  task  in  order  to  be 
able  to  hot  swap  it  are: 

•  the  original  task  needs  to  be  able  to  checkpoint  itself; 

•  the  original  task  must  not  be  currently  using  the  direct- 
routing  capability  of  the  PVM  system. 

The  first  restriction  is  a  direct  consequence  of  the  fact 
that  the  original  task  will  be  substituted  and  there  is  no 
way  to  remove  it.  However,  this  restriction  applies  only  if 
the  PVM  task  is  being  swapped  out  during  the  execution. 
In  the  next  section  we  will  show  how  it  is  possible  to 
remove  it  in  the  case  in  which  a  component  of  a  PVM 
application  is  substituted  at  start-up  time. 

The  second  restriction,  on  the  contrary,  depends  on  our 
implementation  of  the  PVM  compatibility  suite,  in  fact  it 
derives  from  the  fact  that  we  left  the  original  PVM  library 
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untouched  and  unaware  of  the  changes  that  we  introduced 
in  the  demons.  However,  we  plan  to  remove  this 
restriction  in  future  versions  of  the  PVM  compatibility 
suite. 

To  execute  the  actual  run-time  substitution  it  is 
necessary  to  perform  the  following  steps: 

1.  notify  the  PVM  plug-in  that  all  the  traffic  toward  the 
given  task  has  to  be  held; 

2.  checkpoint  the  PVM  task; 

3.  load  in  HARNESS  the  PVM  proxy  plug-in; 

4.  kill  the  original  PVM  task; 

5.  give  the  saved  status  to  the  proxy  plug-in; 

6.  tag  the  PVM  task  TID  in  the  PVM  plug-in  with  the 
proxy  attribute  and  store  the  handle  of  the  actual  plug¬ 
in; 

7.  invalidate  the  possibly  cached  references  to  the 
swapped  out  task  in  all  the  components  of  the  PVM 
plug-in; 

8.  remove  the  hold  on  the  traffic  toward  the  task. 

The  execution  of  the  PVM  application  will  continue 
undisturbed  with  the  exception  that  every  other  PVM  task 
will  experience  a  temporary  lag  in  the  responsiveness  of 
the  swapped  task  while  HARNESS  performs  the  above 
described  steps. 

As  soon  as  the  proxy  plug-in  is  in  place  it  can  start 
acting  as  a  bridge  between  the  legacy  application  and  any 
Harness  service  such  as  the  HARNESS  reusable 
simulation.  This  capability  allows  extending  legacy  PVM 
simulation  with  distributed  components  technology  and 
can  be  used  to  evolve  a  long-running  application  (e.g.  a 
climate  simulation)  according  to  the  results  obtained. 


The  PVM  proxy  can  be  also  used  to  substitute  obsolete 
components  of  a  distributed  application  at  start-up  time. 
This  process  requires  the  application  to  be  capable  of 
pausing  between  the  set-up  phase  and  the  actual  execution. 
However,  it  removes  the  requirement  for  the  task  to  be 
substituted  to  be  able  to  checkpoint  its  status,  in  fact,  the 
actual  substitution  takes  place  before  the  tasks  are 
initialized  with  the  zero  state.  The  actual  sequence  of  steps 
for  a  start-up  time  task  substitution  is  as  follows: 

1.  notify  the  PVM  plug-in  that  all  the  traffic  toward  the 
given  task  has  to  be  held; 

2.  load  in  HARNESS  the  PVM  proxy  plug-in; 

3.  kill  the  original  PVM  task; 

4.  tag  the  PVM  task  TID  in  the  PVM  plug-in  with  the 
proxy  attribute  and  store  the  handle  of  the  actual  plug¬ 
in; 

5.  remove  the  hold  on  the  traffic  toward  the  task. 

The  PVM-proxy  plug-in  automatically  redirects  all  the 
messages  sent  to  the  original  task  to  the  proxied  task.  Thus 
the  new  implementation  begins  execution  directly  from  the 
zero  state  and  there  is  no  need  for  the  original  task  to  be 
able  to  checkpoint  itself. 

5  An  Example  Use:  Removing  the  Obsolete 
Components  in  a  Distributed  MPEG  Coder 

In  this  section  we  will  describe  how  we  have  used  the 
PVM-proxy  plug-in  to  substitute  the  obsolete  components 
of  a  legacy  application,  namely  a  distributed  MPEG-1 
coder  [10]  targeted  at  the  heterogeneous  parallel  SM-IMP 
testbed  architecture  [II]. 
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Figure  9  Macro-block  architecture  of  the  HARNESS  coder  based  on  the  PVM-proxy  plug-in. 


MPEG-1  is  an  ISO  standard  for  motion  picture 
compression  [12].  The  algorithm  defined  in  the  standard 
requires  the  execution  of  several  steps  most  of  which  are 
very  computationally  intensive.  Thus  it  is  very  well  suited 
to  a  distributed  pipelined  implementation.  The  SM-IMP 
project  developed  a  prototype  distributed  MPEG  coder  for 
its  heterogeneous  SIMD-MIMD  parallel  architecture.  This 
coder  divided  the  MPEG  algorithm  in  a  sequence  of 
pipelined  steps.  Each  of  these  steps  was  performed  by  the 
component  of  the  parallel  architecture  whose 
computational  paradigm  was  best  suited  to  the  kind  of 
parallelism  of  the  computational  step  itself.  The  different 
activities  were  glued  together  using  the  PVM  message 
passing  service.  The  main  parallelizable  steps  identified  by 
the  SM-IMP  coder  were: 

•  motion  estimation; 

•  discrete  cosine  transform. 

The  former  step  was  performed  on  a  MASPAR  MP1 
SIMD  array  processor  [13],  while  the  latter  step  was 
performed  by  a  MIMD  transputer  based  multiprocessor. 
The  remaining  interconnecting  steps  were  implemented  as 
sequential  PVM  tasks. 

To  prove  the  feasibility  of  PVM  tasks  substitution  by 
means  of  the  PVM-proxy  plug-in: 

1.  we  substituted  the  obsolete  parallelized  components 
of  the  coder  with  a  new  distributed  farmer/workers 
implementation  based  on  the  HARNESS  reusable 
simulation  framework  [14] 

2.  we  connected  the  components  by  means  of  the  PVM- 
proxy  plug-in. 

In  figure  8  you  can  see  the  macro-block  architecture  of  the 


original  SM-IMP  coder  and  in  figure  9  the  macro-block 
architecture  of  the  new  one. 

The  presence  of  the  PVM-proxy  plug-in  introduces  a  non 
negligible  overhead.  However,  in  our  example  the  original 
PVM  task  has  been  substituted  by  a  the  farmer  plug-in. 
This  plug-in  shares  a  JVM  with  both  the  PVM-plug-in  and 
the  PVM-proxy  plug-in.  Thus,  all  the  data  flowing  through 
the  PVM  demon  with  destination  the  proxied  task  do  not 
need  to  perform  another  hop  through  the  TCP  stack  to 
reach  it.  This  fact  largely  compensates  the  overhead 
introduced  by  the  PVM-proxy  plug-in. 

6  Concluding  remarks 

In  the  field  of  metacomputing,  features  and  capabilities 
are,  by  definition,  subject  to  constant  change.  One  possible 
approach  to  achieve  the  insulaton  of  applications  from  this 
aspect  of  platform  evolution  is  to  employ  a  model  that  is 
extremely  abstract.  However,  this  approach  usually  leads 
to  very  inefficient  systems.  Our  alternative  strategy  is  to 
build  flexibility  in  the  metacomputing  framework  itself,  by 
permitting  software-based  reconfiguration  in  response  to 
both  new  technological  developments  and  application 
program  requirements.*  As  an  example  of  this  flexibility,  in 
this  paper  we  have  described  the  PVM-Proxy  plug-in.  This 
plug-in  leverages  the  HARNESS  capability  to  reconfigure 
the  services  and  programming  environments  provided  by  a 
Distributed  Virtual  Machine  to  transparently  connecting 
legacy  PVM  applications  to  other  Harness  applications. 
This  capability  allows  substituting  obsolete  components  of 
a  legacy  application  without  requiring  any  change  in  the 
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remaining  components  of  the  application. 

To  prove  our  claim  we  have  adopted  as  an  example  a 
PVM  legacy  application,  namely  an  MPEG-1  coder,  that 
was  targeted  to  an  obsolete  heterogeneous  parallel 
architecture.  We  successfully  substituted  the  two  main 
components  of  the  PVM  legacy  application  with  newly 
developed  modules  based  on  the  farmer/workers  paradigm 
and  Java  distributed  object  technology. 

We  believe  that  this  methodology  endows  applications 
with  a  great  deal  of  flexibility  and  the  capability  to  adapt 
to  changing  needs  both  in  terms  of  evolving  hardware  and 
software.  Besides,  Besides,  our  experiments  with  the 
PVM-proxy  plug-in  show  that  the  overhead  introduced  by 
this  degree  of  flexibility  does  not  significantly 
compromise  performance.  Nevertheless,  future  work  will 
aim  at  further  reducing  this  overhead. 
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Abstract 

We  have  developed  a  software  tool  called  MoBiDiCK 
ultimately  intended  for  distributed  computing.  In  this 
report  we  detail  the  design  and  show  results  using  the 
core  components  of  MoBiDiCK  running  two  different 
clients  on  a  local  cluster.  MoBiDiCK  is  a  database 
driven  system  that  can  be  used  to  marshal  a  large  number 
of  processors  across  the  Internet  in  order  to  have  them 
collaborate  on  a  single  computation.  These  utilize  a 
message-passing  API  and  control  synchronization 
formalism  we  have  developed  that  uses  the  HTTP 
standard  and  Web  servers.  CGI  programs  on  the 
volunteer  processors  perform  the  computations.  The 
problem  domains  best  served  by  MoBiDiCK  are  parallel 
computing  problems  that  are  CPU-bound  (not  I/O- 
bound),  and  require  minimal  inter-process 
communication.  The  parallel  tasks  that  we  present 
include  analysis  of  databases  of  three  dimensional  protein 
structures  and  Monte-Carlo  simulations  for  ab-initio 
protein  folding. 


I.  Introduction 

We  are  principally  interested  in  the  protein  folding 
problem  and  our  motivation  to  build  a  distributed 
computing  system  arises  from  our  fundamental  desire  to 
engineer  software  systems  that  have  the  computational 
capacity  to  tackle  the  high-dimensional  problem  of 
protein  folding.  We  are  not  alone  in  the  pursuit  of 
computational  resources,  as  IBM  research  has  recently 
announced  a  project  under  their  “Deep  Computing 
division  to  build  a  massive  new  computer  specifically  for 
the  protein  folding  problem,  one  that  will  achieve  petaflop 
performance  in  five  years[l]. 

We  have  been  designing  optimized  software  for 
protein  folding  for  some  time  and  we  have  recently 
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published  a  report  of  the  first  fast  all-atom  method  for 
generating  plausible  protein  structures  in  real  space[2], 
and  demonstrated  that  the  program,  FOLDTRAJ,  has 
0(NlogN)  time  complexity  in  protein  length. 

FOLDTRAJ  embodies  over  10  years  of  software 
design  and  development  work.  FOLDTRAJ  is  an 
application  developed  using  the  National  Center  for 
Biotechnology  Information  (NCB1)  software  development 
toolkit[3].  The  NCB1  SDK  comprises  source  code  used  in 
many  bioinformatics  applications  such  as  Entrez,  an 
integrated  bioinformatics  database;  BLAST,  a  tool  for 
searching  DNA  and  protein  sequence  databases;  Cn3D,  a 
tool  for  three-dimensional  molecular  structure 
visualization;  Sequin,  an  annotation  tool  for  sequence 
databases,  and  a  growing  number  of  Web  based 
applications  including  the  PubMed  system,  one  of  the  top 
scientific  sites  on  the  Internet.  Within  the  NCBI  toolkit 
lies  rich  source  code  resources  including  a  suite  of  tools 
written  in  C  for  ASN.  1  data  specification,  encoding, 
decoding  and  code  generation,  a  comprehensive  HTTP 
protocol  interface,  as  well  as  our  own  work  on  a 
comprehensive  programming  interface  for  3-D  structure 
applications  known  as  MMDB-API[4]. 

To  tackle  the  protein  folding  problem,  we  wish  to  use 
FOLDTRAJ  on  thousands,  possibly  tens  of  thousands  of 
computers.  Distributed  computing  has  already  been 
explored  for  other  large  problems.  Cryptographers 
studying  brute-force  methods  to  crack  encryption  schemes 
have  already  laid  much  of  the  groundwork  for  distributed 
computing[5].  The  SETI@home  project  (Search  for 
Extra-Terrestrial  Intelligence)  has  demonstrated  that  over 
a  million  processors  across  the  Internet  can  be  brought  to 
bear  on  a  very  large  problem  and  that  people  are  eager  to 
volunteer  spare  CPU  cycles  for  such  causes[6]. 
FOLDTRAJ  has  been  carefully  implemented  so  that  it 
runs  on  a  large  number  of  computing  platforms  including 
NT  and  several  UNIX  variants,  making  it  an  ideal  client 
for  distributed  computing. 
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The  potential  of  the  Internet  as  an  infrastructure  for 
distributed  computing  has  been  predicted  to  reach  exaop 
performance  in  the  year  2007,  perhaps  exceeding  the 
performance  of  massively  parallel  supercomputers  at  that 
time  by  up  to  three  orders  of  magnitude[7].  Current 
efforts  have  already  demonstrated  that  reasonable  coarse¬ 
grained  parallelism  can  be  achieved  using  processors  in 
the  Web. 

MoBiDiCK  is  the  Modular  Big  Distributed  Computing 
Kernel.  Our  goal  in  designing  MoBiDiCK  was  to  allow  us 
to  marshal  our  own  in-house  computing  resources  (Sun 
and  SGI  servers,  workstations,  a  Beowulf  cluster, 
secretarial  and  laboratory  computers).  The  system’s 
design,  however,  allows  any  computer  with  an  Internet 
connection  and  the  ability  to  run  HTTP  server  software  to 
participate  in  a  distributed  computation. 

The  development  of  applications  for  heterogeneous 
computing  environments  can  be  undertaken  in  a  variety  of 
ways.  Some  methods  require  the  adoption  of  a  particular 
programming  model,  as  in  Java  and  MP1,  that  is  coupled 
with  a  specific  computing  environment,  like  the  Internet 
or  a  local  network  of  workstations.  Other  approaches  offer 
underlying  communication  and  management 
infrastructures  that  can  integrate  various  existing  software 
models,  as  exemplified  by  the  Globus  project[8].  The 
MoBiDiCK  effort  fits  best  in  the  middle  of  this  spectrum. 
MoBiDiCK  is  neither  an  infrastructure  nor  a 
programming  model,  rather  it  is  the  middleware  to 
connect  the  two.  The  communication  technology  used  by 
MoBiDiCK  is  far  from  novel:  TCP/IP,  HTTP  and  CGI  are 
long  standing  Internet  standards,  and  the  idea  of  using 
Internet-connected  hosts  for  distributed  computing  dates 
back  several  years. 

Many  scientific  computations  are  essentially  iterative, 
that  is,  the  same  set  of  instructions  are  repeatedly  applied 
to  elements  in  the  problem  domain  space.  These  types  of 
programs  often  lend  themselves  well  to  a  distributed 
approach,  in  which  the  problem  domain  is  divided  among 
a  set  of  nodes  that  simultaneously,  and  more  or  less 
independently,  execute  the  same  program.  One  of  our 
goals  is  to  minimize  the  effort  required  to  introduce  this 
kind  of  parallelism  into  existing  code.  We  address  this  by 
providing  methods  to  enable  an  application  to  operate  as  a 
CGI  program  in  a  distributed  environment,  without 
hampering  the  application’s  ability  to  be  used  in  a  stand¬ 
alone  environment. 

Finally,  we  sought  to  build  a  database  driven  system. 
The  benefits  of  a  database  driven  system  include 
scalability  of  the  system  from  cluster  computing  to  wide- 
area  and  globally  distributed  computing,  as  well  as 
providing  records  of  past  performance  of  applications  that 
can  be  used  to  set  up  initial  load  balancing  and  improve 
the  overall  scheduling  of  distributed  computing  tasks. 


2.  Enabling  Technologies 
2.1.  Hypertext  Transfer  Protocol 

The  Hypertext  Transfer  Protocol  (HTTP)  was 
introduced  around  1990  to  address  the  need  for 
consistency  in  the  manner  in  which  computers  connected 
to  the  Internet  should  exchange  information,  and  has  since 
evolved  to  become  a  de  facto  standard  and  the  most 
widely  used  protocol  on  the  Internet.  HTTP  relies  on 
TCP/IP,  in  which  IP  (Internet  Protocol)  serves  as  an 
addressing  scheme  for  naming  and  identifying  Internet 
hosts,  and  TCP  (Transmission  Control  Protocol)  provides 
routing,  error  detection,  error  recovery,  sequence  control 
and  sequence  flow  mechanisms  for  data  transmitted  from 
one  host  to  another[9]. 

To  access  a  document  on  some  host,  the  user  sends  a 
request  through  a  client  program,  such  as  a  Web  browser, 
to  a  HTTP  daemon  running  on  the  host.  The  daemon,  or 
Web  server,  processes  the  request  and  returns  output  back 
to  the  client.  Authentication  and  access  control  functions 
are  provided  by  the  Web  server  to  secure  private  data. 
Both  client  and  server  software  must  conform  to  the 
HTTP  specification  for  the  transmission  and  execution  of 
requests. 

The  motivation  for  HTTP  was  to  bring  together  and 
share  disparate  information  located  on  geographically 
distributed  machines.  A  shared  document  is  attributed  a 
Uniform  Resource  Identifier  (URI)  that  denotes  the 
document’s  location  in  the  network,  by  identifying  the 
computer’s  IP  address  and  the  file  path.  Files  can  be 
accessed  directly  with  the  URI  and  can  be  interconnected 
by  reference  links  embedded  in  the  document  using  the 
Hypertext  Markup  Language  (HTML).  Documents  whose 
content  is  not  plain-text  can  also  be  accessed.  The 
Multimedia  Internet  Mail  Extensions,  (MIME)  is  applied 
to  HTTP  to  identify  a  file  format  with  a  specific  MIME- 
type  and  to  inform  the  client  about  the  type  of  data  being 
requested[9], 

2.2.  Common  Gateway  Interface 

Along  with  the  ability  to  provide  access  to  static 
documents,  i.e.  those  represented  by  individual  files  on 
local  storage,  Web  servers  must  also  be  able  to  produce 
dynamic  content  which  depends  on  specific  user  input. 
For  example,  the  Web  server  must  be  able  to  accept  a 
search  key  from  the  user,  perform  a  database  query,  and 
return  the  search  results.  Such  capabilities  transcend  the 
scope  of  a  general-purpose  Web  server.  The  Common 
Gateway  Interface  (CGI)  was  established  to  address  the 
need  for  HTTP  software  to  produce  executable 
content!  10].  When  the  client’s  HTTP  request  refers  to  an 
executable  file,  rather  than  a  static  one,  the  Web  server 
launches  the  application  as  a  new,  separate,  server-side 
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process.  Input  parameters  required  by  the  CGI  program 
are  appended  to  the  request;  the  Web  server  simply  passes 
these  on  to  the  application,  often  by  way  of  environment 
variables.  When  the  application  completes,  the  server 
returns  its  output  back  to  the  client.  CGI  guidelines 
provide  an  API  for  the  manner  in  which  data  should  be 
exchanged  between  an  application  and  a  HTTP  daemon. 
Since  this  API  is  purely  syntactic,  a  CGI  executable  can 
be  programmed  using  nearly  any  language  or  platform. 

An  important  drawback  of  CGI  is  that  a  separate 
process  must  be  started  by  the  Web  server  for  each  client 
request.  Process  creation  and  initialization  overhead  can 
cause  a  significant  performance  bottleneck  if  multiple 
consecutive  requests  must  be  served.  A  CGI  program  also 
competes  for  system  resources  with  other  processes, 
including  the  HTTP  server  itself.  Furthermore,  HTTP  is  a 
stateless  protocol  that  does  not  directly  support  the  saving 
of  information  between  requests. 

The  FastCGI  open  protocol  addresses  these  drawbacks 
by  enabling  a  CGI  program  to  persist  across  multiple 
HTTP  requests,  thereby  reducing  process  creation  and 
initialization  overhead  and  allowing  state  information  to 
be  maintained  between  requestsfl  1].  The  FastCGI 
application  library  facilitates  new  application 
development  and  easy  migration  of  existing  CGI 
programs.  It  also  supports  a  distributed  configuration 
whereby  a  program  can  be  invoked  remotely  by  the  Web 
server  over  a  TCP/IP  connection.  As  a  future  direction, 
we  intend  to  use  the  FastCGI  application  library  for  the 
migration  and  further  development  of  MoBiDiCK 
modules. 

3.  System  Design 

3.1.  System  architecture 

MoBiDiCK  is  a  CGI-based  approach  to  distributed  and 
parallel  computing.  It  operates  on  a  set  of  nodes 
interconnected  by  a  TCP/IP  network.  A  node  is  simply  a 
networked  computer  that  can  act  as  a  Web  server,  i.e.  it 
should  be  able  to  support  the  correct  execution  of  HTTP 
server  software.  This  requires  that  it  be  assigned  a  static 
or  dynamic  IP  address.  Network  and  node  characteristics 
may  affect  performance,  but  there  are  no  specific 
hardware  or  operating  system  requirements  other  than 
those  imposed  by  HTTP  server  software.  A  node  can  be  a 
standard  workstation,  a  cluster  node,  or  a  multiprocessor. 
Local  resource  requirements  such  as  disk  space,  memory, 
network  and  I/O  bandwidth  are  commensurate  to  a 
particular  computation.  As  a  general  guideline, 
applications  are  designed  to  minimize  local  resource 
consumption. 


Figure  1.  System  architecture 


In  the  simplest  configuration,  a  kernel  server  is 
designated  such  that  it  can  reach  all  processing  nodes 
through  HTTP;  that  is,  Web  servers  running  on  the  nodes 
should  be  able  to  process  HTTP  requests  received  from 
the  kernel  server.  Five  kernel  modules  are  installed  on  the 
kernel  server:  Dispatcher,  Status,  Statekeeper,  Collector, 
and  DataManager.  These  are  distinct  CGI  applications 
that  carry  out  system  and  data  management  functions  such 
as  node  registration,  task  definition,  task  mapping,  job 
monitoring,  fault  tolerance,  load  balancing,  task 
migration,  output  collection  and  cleanup.  TaskApp 
modules,  which  are  also  CGI-driven  applications,  run  on 
the  nodes;  one  or  more  TaskApp  modules  may  be 
installed  on  a  given  node,  each  embodying  a  distinct 
computational  problem.  For  a  computation  to  be 
distributed  over  a  set  of  nodes,  a  corresponding  TaskApp 
must  be  installed  as  a  working  CGI  program  in  the  Web 
server’s  published  directory  tree.  Figure  1  illustrates  the 
general  system  architecture. 

Interactions  between  CGI  modules  involve  only  one  or 
two  consecutive  HTTP  requests.  For  example,  when  a 
computation  is  required  of  a  node,  the  Dispatcher  sends  a 
request  to  the  node’s  Web  server,  then  simply  terminates. 
Jjust  before  starting  the  subtask,  the  newly  launched  CGI 
process  on  the  node  responds  by  sending  an 
acknowledgement  request  back  to  the  kernel  server.  This 
invokes  a  new  Dispatcher  CGI  process  which  records  the 
acknowledgement  and  immediately  exits.  Meanwhile,  the 
TaskApp  process  continues  with  the  computation. 

Data  management  Is  driven  by  a  relational  database 
implemented  with  the  CodeBase  API  (Sequiter  Software, 
Alberta,  Canada,  www.sequiter.com),  an  efficient  library 
of  high-  and  low-level  data  management  functions  that, 
when  integrated  with  the  kernel  modules,  provide  a  fully- 
functional,  platform-independent  embedded  relational 
data  management  system.  Access  overhead  is  minimal 
since  all  database  functions  are  performed  through  API 
procedure  calls  from  within  the  application  source  code, 
and  updates  and  queries  do  not  require  explicit 
connections  to  an  external  database  engine[12].  The 
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database  is  used  to  hold  and  manage  all  system  data, 
including  descriptive  and  statistical  information  pertaining 
to  nodes  and  computations,  input  parameter  values,  and 
output  files. 

The  kernel’s  modularity  allows  CPU,  I/O  and  network 
resources  to  be  optimized.  System  management  functions 
can  be  distributed  by  spreading  the  kernel  modules  across 
multiple  servers.  For  example,  a  distributed  kernel 
configuration  may  comprise  four  servers:  the  Dispatcher 
is  installed  on  one  server,  the  Collector  on  a  second, 
Status  and  Statekeeper  on  a  third,  and  the  DataManager 
on  a  fourth  server.  The  database  disk  is  remotely  mounted 
by  each  server  and  shared.  The  use  of  a  multiprocessing 
kernel  server  is  an  alternative  configuration  that  may 
benefit  CPU  bandwidth  as  kernel  modules  can  run 
simultaneously  on  several  processors.  I/O  contention 
during  output  collection  can  be  minimized  by  high- 
performance  storage  solutions  such  as  SCSI,  RAID,  and 
FibreChannel. 

3.2.  Communication  model 

MoBiDiCK  modules  communicate  by  embedding 
messages  within  HTTP  “GET”  and  “POST”  requests. 
Figure  2  illustrates  the  general  communication  model 
between  a  module  m,  running  on  node  Nt  behind  HTTP 
server  S|,  and  a  module  m2  running  on  node  N2  behind 
HTTP  server  S2.  The  steps  involved  in  transmitting  a 
message  from  m/  to  m2  are  outlined  below. 

1 .  mi  opens  a  TCP  connection,  C,  between  N i  and  N2 

2.  m,  sends  a  HTTP  request  containing  message,  M,  for 
m7  over  C 

3.  S2  receives  the  request,  starts  m2,  and  passes  the 
request  to  m2 

4.  m2  extracts  M  from  request,  processes  M,  and  writes 
a  reply  R  to  stdout  stream 

5.  S2  captures  R,  attaches  an  HTTP  header  to  it,  and 
sends  it  over  C 

6.  mi  receives  the  response  from  S2  and  reads  R 

7.  m/  closes  C 


Figure  2.  Module  communication  m,  and  m!  are  CGI  modules 
running  on  nodes  N,  and  N2,  behind  HTTP  servers  S,  and  S2. 


Modules  m,  and  m2  designate  any  two  CGI  programs 
on  any  two  nodes.  For  example,  m ,  can  represent  the 
Dispatcher  and  m2  a  TaskApp  module  on  a  compute  node, 
the  message  transmitted  between  them  being  a 


computation  request.  Only  HTTP  server  S2  on  the 
receiving  node  participates  in  the  exchange,  while  S| 
remains  idle. 

This  framework  does  not  preclude  other 
communication  models  within  a  TaskApp  module,  such 
as  MPI  message-passing  primitives  and  other 
programming  models  like  PVM  and  Linda,  as  well  as  new 
parallel  application  development  models. 

3.3.  Task  definition 

A  task  conceptually  represents  an  entire  computation. 

A  task’s  attributes  hold  various  characteristics  of  an 
associated  TaskApp,  such  as  application  filename,  node 
filesystem  path,  and  input  and  output  requirements.  A 
particular  execution  of  a  task  is  an  instance.  An  instance 
inherits  the  task’s  attributes  and  parameters,  and  has  its 
own  attributes  such  as  start  and  end  times,  total  execution 
time,  and  the  set  of  nodes  performing  the  instance. 
Runtime  options  such  as  output  collection  and  logging  are 
also  associated  with  an  instance. 

Once  instantiated,  the  task  is  partitioned  to  produce  a 
set  of  subtasks  that  are  mapped  to  the  set  of  nodes 
selected  to  perform  the  desired  computation.  A  subtask 
inherits  the  attributes  and  task  parameters  of  its  associated 
instance.  Each  subtask  represents  a  unique  TaskApp 
process.  Dispatching  a  task  consists  of  sending  subtask 
requests  to  the  nodes  and  launching  the  TaskApp 
processes.  While  a  TaskApp  is  running,  its  corresponding 
subtask  is  active-,  a  complete  subtask  represents  a  process 
that  executed  successfully;  a  subtask  is  incomplete  if  the 
process  did  not  finish  executing  due  to  user  intervention, 
node  failure,  or  communication  failure.  A  task  instance  is 
active  as  long  as  there  remain  nodes  executing  subtasks, 
and  complete  only  when  all  subtasks  have  been 
completed. 

Input  arguments  required  by  a  task  are  specified  by 
task  parameters.  A  task  parameter  can  be  an  integer,  real, 
Boolean,  or  string,  and  can  be  constant  or  variable.  A 
constant  parameter  is  assigned  a  single  specified  value  for 
all  subtasks.  A  variable  parameter  has  an  associated  range 
which  represents  the  parameter’s  value  space,  delimited 
by  start  and  end  points.  The  range’s  distribution  defines 
the  manner  in  which  the  range  should  be  divided  among 
subtasks.  An  incremental  distribution  signifies  that  a 
specified  step  function  should  be  applied  to  the  range  to 
calculate  subtask  parameter  values.  A  partitioned 
distribution  divides  the  range  into  a  series  of  subranges, 
with  each  subrange  being  assigned  to  a  subtask.  The 
upper  boundary  and  lower  boundary  parameter  types  are 
system-defined  dependent  parameters  associated  with  the 
subrange  boundaries  of  a  partitioned  parameter.  A 
dependent  parameter  can  also  be  defined  as  a 
mathematical  or  logical  function  of  other  parameters. 
Task  attributes  and  parameter  information  is  defined 


326 


through  a  browser  user  interface  and  stored  in  the 
database  by  the  DataManager. 

3.4.  Node  registration 

A  processing  node  is  registered  and  scheduled  using  a 
browser  interface,  invoking  the  DataManager  module  to 
record  the  information  in  a  database.  Registration 
involves  providing  key  attributes  such  as  host  name  or  IP 
address,  CPU  type  and  speed,  number  of  CPUs,  operating 
system,  disk  and  main  memory  capacity,  as  well  as 
contact  information  (in  the  case  of  a  volunteer  processor). 
Once  registered,  a  processor  can  be  scheduled  by 
selecting  the  hourly  time  slots  during  which  it  will  be 
available  for  each  day  of  the  week.  The  processor’s 
participation  can  also  be  restricted  by  specifying  which 
tasks  it  is  available  to  perform. 

This  is  an  all-or-none  scheduling  model:  a  processor 
participates  in  a  computation  only  when  its  schedule 
allows.  The  all-or-none  model  is  well  suited  for  people 
who  wish  to  volunteer  a  known  and  fixed  amount  of  CPU 
time,  such  as  the  times  outside  regular  business  hours  for 
an  office  computer.  By  contrast,  nodes  that  participate  in 
the  SETI@home  project  run  a  computation  selectively 
through  a  specified  “nice”  value  that  assigns  a  local 
scheduling  priority  to  the  process[6],  and  restricting 
access  to  the  node  at  some  point  in  time  requires  explicit 
user  intervention. 

3.5.  Task  execution 

A  distributed  computation  is  requested  from  the 
Dispatcher  through  a  browser  interface.  Once  invoked,  the 
Dispatcher  initiates  interactive  node  selection  by 
compiling  a  list  of  candidate  nodes  for  the  task.  A  node’s 
candidacy  for  a  given  task  is  determined  by  the  following 
conditions: 

•  Registration :  the  node  is  registered  in  the  database 

•  Participation:  the  node  is  registered  to  participate  in 
the  task 

•  Accessibility:  access  to  the  node  is  currently 
permitted  by  the  node’s  schedule 

•  Connectivity:  the  node  is  reachable  by  the  Dispatcher 

•  Configuration:  the  TaskApp  is  operational  on  the 
node 

Registration  and  participation  are  verified  simply  by 
querying  the  database:  a  node  must  be  registered  in  the 
database  if  it  is  to  be  used  at  all,  and  the  selected  task 
must  appear  in  the  node’s  participation  list  to  indicate  its 
“willingness”  to  perform  the  task.  Accessibility  is 
determined  by  looking  at  the  node’s  access  schedule  to 
see  if  the  node  is  currently  accepting  requests  and  if  it  will 
remain  usable  for  a  sufficient  amount  of  time. 


Connectivity  and  configuration  are  determined  in  a  single 
handshaking  step  whereby  the  Dispatcher  sends  a  “test” 
request  to  the  TaskApp  on  the  node.  If  no  response  is 
received  from  the  node,  or  if  an  error  is  encountered  in 
establishing  a  connection,  then  neither  condition  is  met.  If 
the  node’s  Web  server  responds  with  an  error  then 
connectivity  is  achieved  but  the  node  fails  to  meet  the 
configuration  condition.  If  a  correct  response  is  received 
from  the  TaskApp  then  it  is  concluded  that  the  node’s 
Web  server  was  able  to  launch  the  TaskApp  successfully 
and  therefore  the  node  fulfills  both  conditions. 

Other  node  selection  conditions  can  be  specified  by  a 
user  to  further  constrain  node  candidacy,  such  as  cut-off 
values  for  node  rating,  storage  space,  total  main  and 
temporary  memory,  and  number  of  CPUs.  Selection  can 
be  restricted  to  specific  categories  of  nodes,  such  as  local, 
remote,  dedicated,  or  shared.  Manual  selection  of  specific 
nodes  can  also  be  done  to  bypass  preset  conditions. 

After  node  selection  is  settled,  the  Dispatcher  is 
prompted  to  carry  out  the  task  mapping  phase  by 
partitioning  the  task  into  subtasks  and  mapping  the 
subtasks  to  selected  nodes.  More  than  one  subtask  may  be 
assigned  to  a  node,  as  may  be  desired  for  a 
multiprocessor;  a  node’s  subtask-to-CPU  relationship  is 
definable  in  its  registration  information  and  can  be 
modified  by  the  user  prior  to  dispatching. 

Task  mapping  incorporates  an  initial  load  balancing 
step  that  computes  the  load  of  each  subtask  based  on  the 
associated  node  performance  rating.  Subtask  load 
represents  a  fraction  of  the  task’s  total  workload  that 
should  be  allotted  to  the  node  to  which  the  subtask  is 
mapped.  A  node’s  rating  can  be  obtained  by  running  a 
benchmarking  TaskApp  module  that  measures  the  node’s 
ability  in  floating-point  and  integer  arithmetic,  memory 
access,  disk  access,  and  communication.  If  the  rating  is 
known,  subtask  load  can  be  calculated  as  follows. 

For  a  set  of  processing  nodes  ph  p? .  p„  with 

corresponding  ratings  rh  r2 . r„,  the  weighted  rating  for 

Pi  is, 

R=  _ n _  <’> 

Ki  2(r„...,rn) 

Subtask  load  is  obtained  by  dividing  a  node’s  weighted 
rating  by  the  number  of  subtasks  assigned  to  that  node. 
Hence  subtask  load  for  node  p,  that  is  assigned  k  subtasks 
is, 

Rj  (2) 

L(=  — — 
k 

In  the  parameter  allocation  step,  task  parameters  are 
inherited  by  each  subtask  and  values  are  assigned  to  these 
parameters.  The  value  assigned  to  a  subtask  parameter 
depends  primarily  on  the  parameter’s  type.  If  it  is  a 
constant  then  it  simply  takes  on  the  value  defined  for  the 
task.  If  it  is  a  variable  parameter  with  a  partitioned 
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numerical  range,  then  the  value  is  found  by  applying  the 
subtask  load  to  this  range.  For  example,  given  a  numerical 
parameter,  X,  with  a  range  delimited  by  lower  boundary 
xL  and  upper  boundary  x0,  the  value  of  X  for  node  p,  is, 

Xj  =  Li(xu  -  xL).  (3) 

The  mapping  process  can  be  repeated  if  the  user 
wishes  to  modify  node  candidacy  conditions  or  task 
attributes  and  parameters.  At  the  user’s  request,  the 
Dispatcher  begins  the  computation  by  sending  subtask 
requests  to  the  selected  nodes.  A  subtask  request  is 
successful  if  the  Dispatcher  receives  a  confirmation 
message  from  the  TaskApp  module.  If  all  requests  are 
successful,  the  task  instance  is  in  the  active  state. 


Figure  3.  Task  execution  The  Dispatcher  partitions  and  distributes 
tasks  to  currently  scheduled  nodes,  while  the  Collector  gathers  subtask 
results.  During  the  execution  of  a  task,  the  Status  and  Statekeeper 
modules  ensure  fault-tolerance,  load-balancing  and  correct  scheduling. 

If  the  task  is  successfully  launched,  The  Dispatcher 
invokes  a  new  Status  process  and  a  new  Statekeeper 
process  to  monitor  the  TaskApp  processes.  During  its 
execution,  a  TaskApp  regularly  updates  a  local 
SubtaskStatus  file  (shown  in  Figure  1)  with  subtask 
progress  information.  Fault-tolerance  during  a 
computation  is  assured  by  the  Status  module  by 
periodically  downloading  each  node’s  SubtaskStatus  file. 
If  a  subtask  fails  to  complete  on  a  node,  the  Status  module 
re-assigns  the  subtask.  The  Statekeeper’ s  role  is  to 


monitor  node  schedule  overflows  and  maintain  dynamic 
load-balancing.  It  can  reschedule  subtasks  by  terminating 
existing  subtasks  and  starting  new  ones,  thus  maintaining 
the  computation  in  the  “all-or-none”  state  according  to 
each  node’s  schedule  in  the  database.  It  can  remap  the 
entire  task  to  a  new  set  of  nodes  if  too  many  schedule 
conflicts  are  encountered  during  a  computation  or  if  node 
availability  changes  dramatically. 

Local  output  files  produced  by  a  TaskApp  are  recorded 
in  an  OutputList  file.  When  a  TaskApp  completes  a 
subtask,  it  sends  a  completion  request  to  the  Collector. 
The  Collector  updates  the  subtask’s  status  in  the  database, 
obtains  the  OuputList  from  the  node,  gathers  the  output 
files  from  the  node  and  stores  them  in  the  database  if 
required.  The  Collector  is  capable  of  storing  arbitrary  data 
objects  as  binary  files  in  the  database,  and  iterators  are 
provided  in  the  API  for  summarizing  or  combining  results 
once  the  dispatched  subtasks  are  all  completed.  After  all 
output  has  been  received,  the  Collector  sends  a  cleanup 
request  to  the  TaskApp,  asking  it  to  delete  the  output  files 
it  produced  on  the  node’s  filesystem,  as  reported  in  the 
TaskApp’ s  OutputList. 

3.6.  Task  API 

Task  modules  are  programmed  using  the  Task  API, 
itself  integrated  with  NCBI  Toolkit  libraries.  The  Task 
API  facilitates  the  development  of  platform-independent, 
CGI-enabled  applications  which  can  be  operated  both  as 
stand-alone  executables  through  a  regular  command-line 
interface,  and  as  CGI  programs  that  can  be  invoked  by  a 
Web  server  daemon  when  it  receives  a  client’s  HTTP 
request  to  run  the  application.  In  other  words,  the  use  of 
the  Task  API  does  not  restrict  the  application  to  the 
MoBiDiCK  system.  Using  the  same  executable,  the  user 
can  choose  to  perform  the  computation  manually,  or  to 
install  the  program  behind  HTTP  servers  on  a  set  of  nodes 
so  that  the  task  can  be  distributed  using  the  MoBiDiCK 
kernel.  The  core  Task  API  functions  are  shown  in  Table 
1. 

This  flexibility  is  achieved  by  formally  defining  input 
parameters  in  the  TaskApp  program  code.  Each  task 
parameter  is  defined  in  a  TaskArg  structure.  The  members 
of  the  TaskArg  structure  contain,  among  other 
information,  a  parameter’s  command  line  tag,  CGI  field 
name,  type  (integer,  string,  boolean,  etc.),  range  of 
allowed  values,  and  default  value  if  any.  A  set  of  task 
parameters  is  defined  by  declaring  an  array  of  TaskArg 
structures.  Parameter  values  are  obtained  by  the 
application  with  the  GetTaskArgs  function,  which  detects 
the  input  method  (terminal  or  HTTP),  and  accordingly 
reads,  validates  and  copies  parameter  values  in  the 
TaskArg  array  for  subsequent  access. 

Many  computations  consist  of  a  main  loop  that  iterates 
over  a  fixed  and  predetermined  range  of  values,  such  as  a 
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sequence  of  numbers,  a  list  of  strings,  or  a  collection  of 
records.  Tracking  the  progress  of  these  types  of  programs 
is  achieved  by  calling  the  TaskSetSize  function  prior  to 
entering  the  main  loop,  and  by  placing  a  call  to  the 
Tasklterate  function  in  the  loop  body.  TaskSetSize  sets 
the  “problem  size”,  e.g.  the  total  number  of  iterations  to 
be  performed  in  the  main  loop.  Each  time  Tasklterate  is 
called,  the  loop’s  progress  is  computed  by  dividing  the 
current  iteration  number  by  the  total  number  of  iterations. 
The  progress  is  thus  the  percentage  of  the  loop  that  has 
been  completed.  If  the  time  required  for  each  iteration  is 
more  or  less  uniform  (the  loop  body  is  deterministic),  then 
progress  is  proportional  to  the  computation’s  current 
running  time  by  a  more  less  constant  value.  When  the 
loop  body  is  non-deterministic,  as  is  the  case  for  random- 
walk  algorithms,  the  progress  still  provides  a  measure  of 
where  in  the  loop  the  process  is  currently  positioned. 
Subtask  progress  is  written  to  a  SubtaskStatus  file  in  the 
Web  server’s  published  directory  tree.  This  file  is 
periodically  obtained  by  the  Status  kernel  module  while  it 
monitors  the  computation. 

To  communicate  with  a  running  TaskApp  process, 
messages  or  signals  can  be  placed  in  its  SubtaskStatus 
file.  For  example,  when  the  Statekeeper  module  must 
cancel  an  active  subtask,  it  invokes  a  “kill”  TaskApp 
process  on  the  node,  requesting  it  to  interrupt  the 
execution  of  the  subtask.  The  kill  process  writes  a 
“cancel”  signal  to  the  appropriate  SubtaskStatus  file.  The 
signal  is  detected  and  carried  out  in  the  Tasklterate 
function  at  the  next  iteration  of  the  subtask  process,  as  it 
reads  the  SubtaskStatus  file  before  updating  it.  Thus 
Tasklterate  not  only  reports  a  subtask’s  progress  to  the 
kernel,  but  also  serves  to  communicate  with  a  TaskApp 
process  during  execution. 


Table  1.  Core  Task  API  functions 


GetTaskArgs 

GetTaskID 

InitTask 

Ge t I n s t an c eNum 

TaskSetSize 

GetSubtaskID 

Tasklterate 

TaskLogWrite 

Tasklnterrupt 

ErrLogPostEx 

RecordOutput 

TaskComplete 

Ge  t Inpu  tMe  thod 

Output  produced  by  a  TaskApp  is  written  in  the  form 
of  one  or  more  output  files  on  the  compute  node’s 
filesystem.  If  output  is  to  be  collected  by  the  kernel 
server,  the  RecordOutput  function  is  used  to  record  the 
names  of  files  to  be  collected  in  an  OutputList  file  written 
in  a  Web-published  directory  on  the  node.  At  the  end  of 
its  computation,  a  TaskApp  process  informs  the  Collector 


module  that  output  is  ready  to  be  collected.  After 
obtaining  the  OuputList  file  from  the  compute  node,  the 
Collector  proceeds  to  get  all  the  files  contained  in 
OutputList  and  stores  them  in  the  central  database  or  other 
specified  storage  area. 

Core  library  functions  provided  in  Task  API  are  listed 
in  Table  1.  Several  other  routines  are  also  available  as  part 
of  the  NCBI  SDK,  on  top  of  which  the  Task  API  is  built, 
such  as  ASN.  1  encoding,  which  we  use  in  the 
FOLDTRAJ  TaskApp  module.  The  general  structure  of  a 
TaskApp  program  is  shown  in  Figure  4.  A  simple  set  of 
rules,  outlined  below,  must  be  followed  in  order  to 
produce  TaskApp  programs  that  can  correctly  interact 
with  the  kernel. 

a)  Define  all  input  parameters  in  the  TaskArg  array, 
allowing  the  TaskApp  to  receive  arguments  from  the 
Dispatcher. 

b)  Obtain  input  arguments  with  GetTaskArgs.  This 
enables  the  program  to  get  arguments  from  either  the 
command-line  console  or  from  the  Web  server.  If 
invoked  by  the  Dispatcher,  the  TaskApp  also  obtains 
required  MoBiDiCK  arguments  (e.g.  Task  ID, 
Dispatcher  and  Collector  IP  addresses). 

c)  Use  InitTask  to  initialize  the  computation.  Node¬ 
specific  settings  are  read  from  a  TaskApp 
configuration  file;  log  files  are  initialized. 

d)  Set  the  size  of  the  computation  with  TaskSetSize 
prior  to  main  loop. 

e)  Call  Tasklterate  in  main  loop  to  record  subtask 
progress  information  and  read  signals  from  other 
modules. 


function  and  variable  definitions 
TaskArg  array  definition 
mainU)  { 

GetTaskArgs  {.«)  ? 

InitTask () ; 

TaskSetSize  U) ; 
mainLoop  { 

Tasklterate ( ) ; 

RecordOutput (_) ; 

} 

RecordOutput () ; 

TaskComplete 0 ; 

r _ 

Figure  4.  TaskApp  structure  and  core  Task  API  functions 
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f)  Use  RecordOutput  to  report  all  output  files  produced 
during  the  computation.  This  enables  output 
collection  (by  the  Collector  module)  as  well  as  output 
cleanup. 

g)  Call  TaskComplete  before  exiting.  This  invokes  the 
Collector  to  collect  output,  request  output  cleanup, 
and  record  timing  and  status  information. 

h)  If  the  TaskApp  must  exit  prematurely  during  the 
computation,  use  Tasklnterrupt. 

i)  Record  informational  and  error  messages  in  subtask- 
specific  log  files  using  TaskLogWrite  and 
ErrLogPostEx. 

4.  Results 

4.1.  RAMAPLOT 

Using  the  Task  API  we  developed  a  TaskApp  module 
called  RAMAPLOT.  This  program  uses  three- 
dimensional  protein  structure  information  from  NCBI’s 
Molecular  Modeling  Database  (MMDB)  to  generate  a 
Ramachandran  plot  for  the  structure,  which  is  a  graph  of 
the  distribution  of  the  protein’s  a-carbon  bond  angles  in 
angular  2-D  space  (written  as  a  GIF  file).  We  performed 
this  task  using  15  nodes  on  our  Beowulf  cluster,  each 
node  configured  with  two  Intel  Pentium  II  400MHz 
processors,  512Mb  of  RAM,  the  RedHat  Linux  operating 
system,  and  the  Apache  HTTP  server.  The  nodes  were 
interconnected  by  a  1  OOBase-T  network.  The  kernel 
server  was  a  Sun  Sparc  Ultra- 1  running  Solaris  2.6.  The 
MMDB  database  was  copied  to  each  node’s  hard  disk. 

The  goal  of  the  task  was  to  generate  one 
Ramachandran  plot  for  each  of  851  protein  structures  in 
the  MMDB.  Two  input  parameters,  dbsize  and  dbstart , 
were  defined  for  the  RAMAPLOT  task.  The  partitioned 
parameter,  dbsize,  represents  the  number  of  records  to  be 
processed,  ranging  from  1  to  851  (corresponding  to  the 
first  and  last  record  indexes  of  the  database,  respectively). 
dbstart  is  defined  as  a  lower  boundary  for  the  subranges 
of  dbsize  serving  to  inform  the  program  of  the  starting 
database  record  number.  We  performed  15  instances  of 
the  RAMAPLOT  task,  starting  with  a  single  node  and 
adding  an  additional  node  for  every  new  instance.  Only 
one  CPU  per  node  was  used.  Execution  time,  speedup, 
and  efficiency  were  determined  for  each  instance  and  are 
summarized  in  Table  2  and  plotted  in  Figures  5,  6  and  7. 


Table  2.  RAMAPLOT  timing  results  using  MoBiDiCK 


Instance 

Subtasks 

Time  (s) 

Speedup 

Efficiency 

1 

1 

714 

0.972 

97.2 

2 

2 

378 

1.84 

92.0 

3 

3 

265 

2.62 

87.2 

4 

4 

195 

3.57 

89.1 

5 

5 

165 

4.20 

84.1 

6 

6 

144 

4.82 

80.3 

7 

7 

116 

6.01 

85.8 

8 

8 

106 

6.55 

81.9 

9 

9 

107 

6.47 

71.9 

10 

10 

96 

7.19 

71.9 

11 

11 

79 

8.79 

79.9 

12 

12 

87 

8.00 

66.7 

13 

13 

76 

9.09 

69.9 

14 

14 

66 

10.49 

74.9 

15 

15 

69 

10.07 

67.1 

Speedup  is  defined  as  the  ratio  TP/TS,  where  TP  is 
parallel  execution  time  and  Ts  is  the  best  serial  execution 
time.  Ts,  obtained  by  running  RAMAPLOT  on  one  cluster 
node  through  a  command-line  interface,  was  found  to  be 
695  seconds.  Efficiency  is  defined  as  the  ratio  So/Smax  of 
observed  speedup  over  ideal  speedup,  with  Smax  always 
equal  to  the  number  of  nodes. 


Figure  5.  Execution  time  for  RAMAPLOT.  The  task  was 
repeated  by  varying  number  of  nodes. 


A  steady  speedup  was  obtained  as  the  number  of  nodes 
was  incremented  from  1  to  15.  The  use  of  all  15  nodes 
yielded  a  speedup  of  10.1,  corresponding  to  67.5% 
efficiency. 
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Figure  6.  Speedup  for  RAMAPLOT.  Ideal  speedup  is 
represented  by  the  broken  line. 

An  increase  in  the  number  of  subtasks  raises  the 
probability  that  any  two  subtasks  complete  at  the  same. 
This  leads  to  increased  network  and  I/O  contention  on  the 
kernel  server  as  subtask  completion  requests  invoke 
Collector  processes.  These  effects  are  reflected  in  Figure 
7  by  a  regular  decrease  in  efficiency  as  the  number  of 
nodes  rises.  An  average  efficiency  of  80.0%  was  achieved 
overall  15  instances. 


Figure  7.  Efficiency  for  RAMAPLOT 
4.2.  FOLDTRAJ 


in  developing  methods  to  score  the  fitness  of  generated 
proteins,  is  measured  by  calculating  its  Root  Mean 
Squared  Deviation  (RMSD)  relative  to  the  protein’s 
native  fold. 

FOLDTRAJ  was  developed  independently  and 
intended  to  be  used  both  as  a  stand-alone  application  and 
in  a  distributed  computing  framework.  We  used  the  Task 
API  to  migrate  FOLDTRAJ  to  operate  as  a  TaskApp 
under  MoBiDiCK.  Integration  with  the  Task  API  also 
enabled  the  same  FOLDTRAJ  executable  to  be  operated 
through  an  HTML-based  interface  in  both  stand-alone  and 
client-server  modes. 
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Figure  8.  Task  definition  and  mapping  of  FOLDTRAJ. 

The  definition  includes  general  task  information  and  parameter 
information  (top  left  frame).  Four  dual-processor  nodes  are  selected 
(top-right  frame).  The  task  has  been  partitioned  into  8  subtasks,  each 
mapped  to  a  node’s  processor.  Subtask  parameters  are  inherited  from 
the  task  and  assigned  values  (middle  frame).  The  task  can  be  re- 
partitioned  after  editing  the  task  or  modifying  node  selection  (through 
menu  in  bottom  frame). 


Much  of  our  work  is  dedicated  to  the  protein  folding 
problem.  This  problem  in  the  field  of  structural  biology 
represents  our  inability  to  computationally  predict  the 
three-dimensional  conformation  of  arbitrary  proteins 
given  only  primary  amino-acid  (A  A)  sequence 
information.  Given  an  input  file  known  as  a  “trajectory 
distribution”  containing  angular  2-D  space  amino-acid 
frequency  information  about  a  particular  protein, 
FOLDTRAJ  can  generate  a  number  of  random  yet 
chemically  valid  protein  conformers,  placing  each  in  a 
separate  output  file  in  binary  ASN.l  or  ASCII  PDB 
format.  The  correctness  of  a  predicted  structure,  important 


An  example  task  definition  and  mapping  for 
FOLDTRAJ  is  shown  in  Figure  8.  The  task  has  three 
significant  parameters:  filein,  numstruc  and  /start.  The 
first,  filein,  is  a  constant  string  parameter  that  holds  the 
name  of  the  input  file;  in  the  figure,  the  input  filename 
“lvii”  corresponds  to  the  36  amino-acid  Villin  headpiece 
protein,  a  small  protein  we  use  for  testing.  The 
partitioned  integer  parameter,  numstruc,  is  the  total 
number  of  structures  to  be  generated;  its  range  is  set  from 
1  to  500,000,  signifying  that  half  a  million  structures  are 
to  be  generated.  The  integer  parameter  /start  denotes  the 
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starting  structure  number  and  is  set  as  a  lower  boundary 
of  numstruc,  with  the  same  range.  The  example  instance 
was  mapped  to  four  of  our  dual-processor  cluster  nodes, 
using  both  CPUs  per  node,  resulting  in  8  subtasks  as 
shown  in  Figure  8.  Subtask  loads  are  equal  (0.125)  since 
the  nodes  were  rated  equally.  Since  filein  is  constant,  each 
subtask  is  assigned  the  same  value.  The  range  of  numstruc 
is  divided  equally  among  the  8  subtasks.  A  subtask’s 
/start  value  informs  the  TaskApp  where  in  the  range  it 
should  start  numbering  its  structures  and  is  used  to 
produce  unique  output  filenames. 

Using  MoBiDiCK  we  regularly  perform  distributed 
FOLDTRAJ  computations  to  carry  out  prediction 
experiments  on  various  proteins.  Figure  9  plots  the  results 
of  four  such  experiments.  In  each,  50,000  protein 
structures  were  generated  using  15  of  our  dual-processor 
cluster  nodes.  The  frequency  distribution  of  the  resulting 
Root  Mean  Squared  Deviation  values  indicate  the 
accuracy  of  protein  backbone  atom  prediction  of  random 
protein  conformers  made  with  FOLDTRAJ.  From  this 
ongoing  testing  we  are  obtaining  a  better  understanding  of 
the  relationship  between  sample  size,  protein  size  and 
how  well  entities  in  the  sampled  protein  ensembles  fit  the 
true  structure  of  a  protein. 


Figure  9.  RMSD  frequency  distribution  of  random  protein 
structures  generated  with  FOLDTRAJ.  Each  curve  represents  a 
separate  instance  of  FOLDTRAJ  generating  50,000  structures  using  30 
subtasks  on  1 5  nodes;  protein  name,  size  in  number  of  amino-acids 
(AA),  and  execution  time  are  indicated  for  each  experiment. 

5.  Related  work 

Existing  cluster  computing  tools  that  are  publicly 
available  include  PVM  (Parallel  Virtual  Machine)  and 
MPI  (Message  Passing  Interface).  Commonly  used  on 
LANs  and  clusters  of  workstations,  these  systems  provide 
resource  encapsulation  and  monitoring  functions,  and 
transparent  heterogeneity  utilities  through  a  messaging 
API.  PVM  provides  a  high-level  system  for  a  user  to 
coordinate  tasks  spread  across  a  network  of  heterogeneous 
workstations.  The  set  of  nodes  is  perceived  as  a  single 


virtual  machine  through  a  message-passing  abstraction 
and  a  library  of  functions  for  task  creation  and 
management[13][14]. 

Other  systems  aside  from  the  more  widely  used  PVM 
and  MPI  are  numerous  and  many  of  their  aspects  can  be 
directly  compared  with  MoBiDiCK.  Their  applications  to 
cluster  and  distributed  computing  have  provided  us  with  a 
useful  study,  including  Globus,  SuperWeb,  Condor, 
Linda,  Piranha,  NOW,  Legion,  WebOS,  Atlas,  ParaWeb, 
Bayanihan,  Popcorn,  Charlotte,  JPVM,  RMI,  CORBA, 
Javelin,  Nimrod,  Clustor,  JICE,  LSF.  At  the  time  we 
began  building  MoBiDiCK  (Oct.  1997)  it  was  not  clear  to 
us  that  other  systems  were  capable  of  doing  the  multiple 
duty  that  FOLDTRAJ  required  for  distributed  computing 
over  the  Internet  with  clients  at  this  level  of 
sophistication.  We  therefore  set  out  to  develop 
MoBiDiCK  with  goals  that  it  provide  an  integrated 
environment  for  process  control  as  shown  in  Figure  8,  and 
be  capable  of  large  scale,  high  performance,  database- 
driven,  heterogeneous  distributed  computing. 

6.  Future  Directions 

6.1.  Estimating  subtask  execution  time 

In  general,  the  execution  time  of  a  subtask  on  a  node  is 
influenced  by  (1)  subtask  load  and  (2)  node  rating. 
Subtask  load  is  a  computed  fraction  of  the  total  work  to  be 
done  in  the  task.  Node  rating  is  a  measure  of  a  node’s 
performance  as  determined  by  a  benchmarking  procedure 
that  incorporates  limiting  factors  associated  with  network 
bandwidth  and  congestion,  CPU  performance,  memory 
and  storage  availability,  and  I/O  efficiency.  A  node’s 
rating  is  directly  proportional  to  this  measured 
performance. 

The  execution  time  T  of  subtask,  S,  with  load  L 
assigned  to  a  node  with  rating  R  can  be  estimated  by 
T  =  /jF(L)/R.  F  is  the  task’s  time  complexity  as  a  function 
of  subtask  load.  If  known,  F  can  be  supplied  as  an 
attribute  of  the  task  and  saved  in  the  database.  The 
constant  h  is  given  by  F(Lj)/Rj,  where  Lj  is  the  average 
subtask  load  over  previous  instances  (with  the  same 
parameter  definitions)  and  R,  is  the  average  rating  of  all 
nodes  that  computed  these  subtasks,  h  thus  captures  a 
task’s  performance  history  over  all  previous  instances  and 
nodes.  Since  timing  and  node  rating  information  are 
stored  in  the  database  for  every  task  instance,  k  can  be 
quickly  updated  after  each  task  execution,  keeping  its 
value  readily  available  to  the  Dispatcher  to  calculate  T 
during  the  node  selection  phase  of  future  instances. 

6.2.  Selective  fault  tolerance 

The  problem  of  managing  distributed  computations  on 
a  collection  of  heterogeneous  nodes  is  challenging:  both 
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the  availability  and  performance  of  nodes  are 
unpredictable,  and  effective  mechanisms  must  exist  to 
detect  and  handle  failures.  Fault-tolerance  duties  in 
MoBiDiCK  are  centralized  at  the  kernel  level,  instead  of 
being  distributed  as  in  PVM.  This  is  because  in  a 
distributed  computing  model  the  authority  granted  to 
compute  on  a  node  is  much  more  limited.  Malfunctions 
that  occur  during  a  computation  are  abstracted  from  the 
TaskApp,  which  in  many  cases  is  favorable  to  the 
application  developer  since  it  removes  the  burden  of 
appropriately  responding  to  failures.  Two  conditions 
should  remain  satisfied  throughout  the  execution  of  a 
task:  the  task  will  complete  and  performance  is 
maximized. 

Task  completion  is  ensured  by  detecting  and 
remedying  process-level  and  node-level  failures.  Process 
failure  occurs  when  a  running  TaskApp  program  is 
prematurely  terminated  either  by  the  process  itself  due  to 
execution  error,  by  the  operating  system,  by  the  HTTP 
server,  or  by  direct  user  intervention  at  the  node  console. 
Causes  of  node  failure  include  operating  system 
instability,  faulty  hardware,  HTTP  server  malfunction  or 
misconfiguration,  and  communication  link  breakdown. 

To  detect  and  respond  to  such  occurrences,  the 
Dispatcher  invokes  the  Status  module  as  soon  as  a  task  is 
dispatched.  Status  carries  out  monitoring,  fault  detection 
and  fault  recovery  functions  for  an  active  task  instance. 
From  each  node  performing  a  task,  Status  periodically 
downloads  the  SubtaskStatus  file  produced  by  the 
TaskApp  process.  This  file  contains  a  current  percentage- 
done  progress  of  the  assigned  subtask;  the  subtask’s 
database  record  is  updated  with  each  new  progress  value. 
Process  failure  is  suspected  if  no  change  in  subtask 
progress  is  observed  over  a  sufficient  length  of  time.  If  the 
SubtaskStatus  file  cannot  be  obtained  from  the  node’s 
HTTP  server  after  several  attempts,  node  failure  is 
suspected.  Node  failure  implies  the  failure  of  all  active 
subtasks  assigned  to  the  node. 

The  Status  module  can  respond  to  subtask  failure  in  a 
variety  of  ways.  The  exact  measures  to  be  taken  can  be 
user-specified  before  and  during  the  computation. 
Possible  fault  recovery  behaviors  are: 

(a)  Carry  on  with  the  computation 

The  remaining  subtasks  are  left  running  and  the 
failure  is  disregarded. 

(b)  Cancel  execution 

Status  sends  a  “cancel”  signal  to  the  participating 
nodes  in  order  to  terminate  the  remaining  subtasks  for 
the  instance.  The  computation  is  terminated  and  no 
further  action  is  taken. 


(c)  Restart  the  execution 

The  instance  is  cancelled  as  in  (b)  and  the  Dispatcher 
is  invoked  to  launch  a  new  instance,  replacing  the 
faulty  nodes  with  new  ones  if  possible. 

(d)  Re-allocate  the  failed  subtask 

This  can  be  done  in  at  least  two  ways: 

•  reassign  or  migrate  the  subtask  to  a  new  node  if 
one  is  available,  or 

•  redistribute  the  subtask’s  load  to  other  active 
nodes. 

6.3.  Task  migration 

The  all-or-none  model  that  MoBiDiCK  uses  offers 
some  unique  cases  to  consider  for  task  migration.  The 
access  period  represents  how  long  a  node  is  available  at  a 
given  point  in  time;  it  is  computed  from  the  node 
schedule,  a  block  of  24x7  cells  representing  each  hour  in 
the  week.  For  example,  if  a  node’s  access  schedule 
permits  use  from  6  p.m.  to  1 1  p.m.  on  a  given  day  then,  at 
8:30  p.m.  the  same  day,  the  access  period  is  2.5  hours.  A 
schedule  overflow  occurs  when  the  time  to  completion  of 
a  TaskApp  process  running  on  a  node  exceeds  the  node’s 
access  period.  During  a  computation,  a  Statekeeper  kernel 
process  periodically  scans  the  database  to  detect  possible 
schedule  overflows,  by  checking  a  subtask’s  progress 
(updated  by  the  Status  module)  against  the  access  period 
of  the  node  performing  the  subtask.  If  an  overflow  is 
anticipated,  the  subtask  is  migrated  to  another  node  by  the 
Statekeeper,  by  invoking  a  new  Dispatcher  process  to  re¬ 
send  the  failed  subtask. 

Prior  to  task  dispatching,  an  initial  load  balancing  step 
determines  the  load  of  each  subtask  based  on  the  rating  of 
the  associated  node.  Yet  a  node’s  performance  may  vary 
during  the  computation  due  to,  for  example,  increased 
CPU  or  I/O  contention,  causing  a  TaskApp  process  to 
slow  down.  If  left  unchecked,  this  can  lead  to  significant 
idle  time  and  hence  reduced  speedup.  To  avoid  this,  we 
intend  to  equip  the  Statekeeper  module  with  dynamic  load 
balancing  and  migration  functions  through  which  the  load 
of  a  slow  subtask  can  be  wholly  or  partially  transferred  to 
faster  nodes. 

6.4.  Other  directions 

In  addition  to  the  above  work  showing  the  use  of 
MoBiDiCK  in  a  controlled  cluster  environment,  further 
components  are  being  implemented  to  allow  wide 
distributed  computing. 

6.4.1.  Accessing  clusters  with  hierarchical  kernels. 
Many  Beowulf  clusters  are  set  up  using  IP  addresses  in 
the  192.168.x.x  or  other  non-public  ranges.  This 
precludes  them  from  being  seen  over  the  Internet  and  used 
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in  the  MoBiDiCK  system  described  so  far.  However  most 
are  configured  with  a  “gateway”  node,  which  is  usually 
the  cluster  “head”  and  has  a  public  IP  address  and  is  set 
up  to  use  IP  masquerading  so  that  nodes  can  access  the 
Internet.  We  have  devised  a  method  to  allow  MoBiDiCK 
to  operate  on  these  clusters,  forming  “clusters  of  clusters”. 
This  is  a  configuration  that  consists  of  a  root  kernel  and 
several  child  kernels  that  manage  computations  on  local 
site  nodes  only.  Since  the  Dispatcher  and  other  kernel 
modules  are  already  CGI  modules,  they  can  themselves  be 
made  into  TaskApp  processes,  and  arranged  in  a 
hierarchy.  A  child  kernel  receives  a  single  subtask  from  a 
parent  kernel.  The  parent  kernel  sees  the  child  kernel  as 
a  single  multiprocessor  system  with  a  subtask  load  based 
on  the  cumulative  rating  of  the  child  kernel’s  local  nodes. 
The  child  Dispatcher  interprets  the  subtask  as  a  local  task, 
and  thus  applies  the  same  partitioning  and  mapping 
mechanisms  as  the  parent  Dispatcher.  Output  is  first 
collected  locally;  the  child  Collector  then  passes  local 
output  to  its  parent  Collector.  Child  Status  and 
Statekeeper  modules  maintain  fault-tolerance  and  task 
migration  locally,  while  sending  periodic  summary 
progress  reports  to  their  parent  counterparts.  This 
configuration  may  enable  nodes  hidden  behind  firewalls 
as  well  as  internal  cluster  nodes  to  participate  in 
distributed  computations  through  the  child  kernel.  This 
approach  may  also  be  taken  on  large  multiprocessor  SMP 
machines;  it  is  not  limited  to  cluster  use.  It  may  also  be 
used  to  enhance  scalability  by  distributing  the  task 
management  load  to  multiple  servers  and  making  it  easier 
to  manage  a  large  number  of  nodes. 

6.4.2.  Kernel  redundancy.  To  avoid  single  points  of 
failure,  kernel  modules  can  be  mirrored  across  several 
failover  servers.  Nodes  are  made  aware  of  alternate  kernel 
locations  so  that  if  one  server  fails,  another  can  assume 
task  management  and  output  collection  functions. 

6.4.3.  FastCGI.  Performance  of  kernel  modules  may  be 
significantly  improved  by  migrating  them  to  the  FastCGI 
extension.  This  will  be  of  particular  benefit  to  the 
Collector  module.  Under  standard  CGI,  a  new  Collector 
process  is  created  for  each  subtask  completion  request, 
hence  repeatedly  incurring  process  creation  and 
initialization  overhead.  Under  FastCGI,  one  or  just  a  few 
persistent  Collector  processes  would  handle  all  requests. 

6.4.4.  Volunteer  computing.  We  hope  to  involve  the 
general  public  in  our  distributed  protein  folding 
experiments,  by  asking  them  to  register  their  Web  server 
nodes  and  volunteer  idle  CPU  cycles.  The  node  access 
schedule  gives  full  control  to  node  administrators  and 
owners  as  to  when  and  how  long  their  nodes  can  be  used. 
Additional  security  features,  such  as  Web  server  level 


authentication,  will  be  required  in  the  kernel  modules  to 
ensure  safe  access  of  volunteer  nodes. 

7.  Summary 

We  presented  MoBiDiCK  as  a  tool  for  distributed 
computing  based  on  well  established  protocols,  HTTP  and 
CGI.  The  relatively  high  communication  latencies  of 
these  protocols  over  the  Internet  render  the  system  to  be 
most  suitable  for  data-parallel  tasks  that  are  CPU¬ 
intensive  and  that  require  minimal  inter-node 
communication.  Node  accessibility  is  controlled  by  a  real¬ 
time  access  schedule.  A  central  relational  database  is  used 
to  hold  static  information  about  nodes  and  tasks,  as  well 
as  dynamic  data  relating  to  node  availability  and  task 
progress.  The  database  also  stores  useful  task  timing 
information  that  can  be  used  to  build  performance  reports 
and  histories  of  past  computations,  which  in  turn  can 
serve  to  predict  and  optimize  the  performance  of  future 
instances.  We  reported  the  development  of  two  TaskApps, 
RAMAPLOT  and  FOLDTRAJ.  The  former  yielded 
encouraging  speedup  and  efficiency  results  using  a  local 
cluster  of  Web  server  nodes,  despite  I/O  contention  on  the 
kernel  node  during  collection  of  output.  This  problem  is, 
however,  not  unique  to  our  system.  FOLDTRAJ  is  an 
application  we  regularly  employ  in  the  MoBiDiCK 
environment  in  order  to  perform  our  computational 
protein  folding  experiments.  We  are  continuing  with  our 
development  of  MoBiDiCK  and  look  forward  to  carry  on 
with  our  future  directions. 
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Abstract 

RsdEditor  is  a  graphical  user  interface  which  produces 
specifications  of  computational  resources.  It  is  used  in  the 
RSD  (Resource  and  Service  Description )  environment  for 
specifying ,  registering,  requesting  and  accessing  resources 
and  services  in  a  metacomputer. 

RsdEditor  was  designed  to  be  used  by  the  administra¬ 
tors  and  users  of  metacomputing  environments.  At  the  ad¬ 
ministrator  level,  the  GUI  is  used  to  describe  the  available 
computing  and  networking  components  of  a  metacomputer. 
At  the  user  level  RsdEditor  can  be  used  to  specify  which 
characteristics  of  the  computational  resources  are  needed 
to  execute  a  meta-application. 

This  paper  is  organized  as  follows:  Section  I  introduces 
the  RsdEditor.  Section  2  briefly  describes  the  RSD  environ¬ 
ment,  and  Section  3  highlights  various  features  and  imple¬ 
mentation  issues  of  the  RsdEditor. 


Keywords:  Metacomputing,  Resource  Management,  Re¬ 
source  and  Service  Description. 

1.  Introduction 

RsdEditor  is  a  graphical  user  interface  for  specifying 
metacomputer  resources.  It  was  developed  by  CNUCE- 
CNR  in  cooperation  with  PC2  Paderbom  as  part  of  the 


Metacomputer  On-Line  (MOL)  project  [1].  MOL  exploits 
the  Computing  Center  Software  (CCS)  [2,  3]  in  order  to 
manage  the  resources  of  a  computing  center.  Within  CCS, 
the  resource  and  service  description  language  RSD  [4]  is 
used  to  describe  the  metacomputer  resources  managed  by 
CCS. 

RsdEditor  provides  a  user-friendly  visual  support  to  au¬ 
tomatically  generate  ASCII  files.  These  are  structured  ac¬ 
cording  to  the  RSD  language  which  describes  the  specified 
resources  by  using  the  graphical  features  of  the  interface. 
RsdEditor  was  designed  to  be  used  by  administrators  and 
users  of  a  metacomputer. 

As  shown  in  Figure  1,  system  administrators  use  the 
RsdEditor  to  specify  particular  characteristics  of  the  meta¬ 
computer’s  resources  (computational  nodes,  networks,  soft¬ 
ware  services,  etc.).  Specifically  the  administrator  can 
assign  attributes  such  as  type  and  number  of  processors, 
memory  size,  software  environments,  architectural  classes, 
latency,  or  bandwidth  to  the  available  computational  re¬ 
sources.  Likewise,  users  can  specify  which  characteris¬ 
tics  of  the  computational  resources  are  needed  to  execute 
their  meta-application.  This  phase  is  not  intended  to  select 
specific  resources  but  to  indicate  the  general  attributes  be¬ 
longing  to  a  class  of  resources.  The  specifications  made 
by  the  administrator  and  the  user  can  be  used  to  generate 
two  graphs  representing  the  metacomputer’s  configuration 
and  the  user’s  requirements,  respectively.  The  allocation  of 
the  resources  needed  to  execute  the  meta-application  on  the 
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Figure  1.  The  use  of  the  specifications  gener¬ 
ated  by  RsdEditor 


metacomputer  is  a  question  of  mapping  the  user  graph  onto 
the  metacomputer  graph. 

Several  mapping  algorithms  [5,  6,  7,  8]  have  been  put 
forward  to  solve  this  problem. 

The  RSD  resource  specification  file,  automatically  gen¬ 
erated  by  RsdEditor ,  is  analyzed  by  a  parser  to  obtain  Ab¬ 
stract  Data  Type  ( ADT )  objects  [4].  ADT  objects  can  only 
be  accessed  through  the  RsdAPIwhich  provides  an  abstract 
interface  to  the  RSD  data  structures.  As  shown  in  Figure  1, 
the  mapping  of  the  task  graph  onto  the  available  resources 
as  generated  by  the  mapping  algorithm  is  used  by  CCS  to 
start  and  control  the  execution  of  a  meta-application. 

2.  RSD  Environment 

RsdEditor  is  part  of  the  RSD  environment  [4]  that  pro¬ 
vides  services  and  tools  for  specifying,  registering,  request¬ 
ing  and  accessing  computer  resources  in  heterogeneous 
computing  environments.  RSD  is  comprised  of  three  ma¬ 
jor  components: 

•  a  compiler  system  that  transforms  resource  descrip¬ 
tions  into  ADTs  (described  in  [4]), 

•  an  ADT  object  library  with  API  (outlined  in  [4]), 

•  a  graphical  user  interface  and  editor  RsdEditor  (de¬ 
scribed  in  this  paper). 

In  RSD,  resources  and  services  are  represented  by  hier¬ 
archical  graphs  with  attributed  nodes  and  edges  describing 
static  and  dynamic  properties  such  as  communication  band¬ 
width,  message  latency,  or  CPU  load.  Tools  exist  for  end- 
users  as  well  as  for  system  administrators  (Figure  2). 


*  ii 


Figure  2.  RSD  environment 

The  output  of  the  graphical  RsdEditor  is  sent  through 
the  RsdParser  which  generates  abstract  data  objects  that  can 
be  stored  or  submitted  for  further  processing  by  (remote) 
resource  management  systems.  In  addition,  ASCII  RSD 
files  can  also  be  translated  by  the  parser  into  abstract  ob¬ 
jects.  By  bundling  objects  with  the  corresponding  methods 
the  data  can  be  interpreted  and  manipulated  on  other  ma¬ 
chines.  Internal  RSD  objects  can  only  be  accessed  through 
the  RsdAPIwhich  provides  the  data  structures  with  an  ab¬ 
stract  interface.  For  later  modification,  the  data  structures 
are  re-translated  into  their  original  form  with  the  graphical 
and  textual  components.  This  is  possible  because  the  in¬ 
ternal  data  representation  also  contains  a  description  of  the 
component’s  graphical  layout. 

2.1  Textual  versus  Graphical  Interface 

RsdEditor  has  been  designed  to  provide  a  user-friendly 
alternative  to  the  textual  RSD  representation  [4].  From  a 
theoretical  point  of  view,  both  representations  are  equiva¬ 
lent.  In  fact,  the  RsdEditor  output  is  parsed  and  compiled  by 
the  same  RSD  compiler  system  used  for  the  language  rep¬ 
resentation.  Hence,  RsdEditor  is  no  more  expressive  than 
the  language.  On  the  other  hand,  it  is  easy  to  prove  that 
the  language  is  no  more  expressive  than  RsdEditor  by  look¬ 
ing  at  the  more  advanced  features  of  the  language,  such  as 
dynamic  attributes  and  macros. 

Dynamic  attributes  [4]  provide  a  means  of  handling  dy- 
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namic  data  that  are  obtained  at  runtime.  For  example,  when 
running  a  WAN  distributed  application,  the  optimal  (re¬ 
mapping  of  the  processes  may  depend  on  the  current  net¬ 
work  performance.  For  this  purpose,  dynamic  attributes 
provide  up-to-date  information  on  the  current  network  sta¬ 
tus.  When  a  dynamic  attribute  (keyword  DYNAMIC)  is 
parsed,  the  compiler  system  generates  a  corresponding  ob¬ 
ject  with  appropriate  access  methods.  These  are  then  used 
by  the  dynamical  data  manager  at  runtime  to  provide  up-to- 
date  data  in  a  synchronous  or  asynchronous  way.  Dynamic 
attributes  can  be  specified  in  the  same  way  in  the  RsdEditor 
and  in  the  textual  representation. 

One  feature  not  included  in  the  RsdEditor  are  macros. 
In  the  textual  representation,  they  provide  a  shortcut  for  te¬ 
dious  repetitive  declarations. 

In  RsdEditor ,  this  is  done  by  copying  the  corresponding 
edges  or  (hyper-)nodes. 

2.2  RSD  Tools  in  CCS 

Maximizing  the  system  utilization,  and  maintaining  a 
high  degree  of  system  independence  for  improved  portabil¬ 
ity  and  easier  adaptation  to  new  systems  have  been  the  two 
main  goals  of  the  CCS  [3]  project.  CCS  tackles  these  two 
conflicting  goals  by  splitting  the  scheduling  process  into 
two  parts.  Figure  3  depicts  the  RSD  flow  in  CCS. 

The  hardware  independent  part  is  located  in  the  Queue 
Manager  (QM).  It  has  no  information  on  the  mapping  con¬ 
straints  such  as  the  minimum  cluster  size,  or  the  location  of 
I/O-nodes.  The  hardware  dependent  task  is  performed  by 
the  Machine  Manager  (MM).  It  verifies  whether  a  schedule 
computed  by  the  QM  can  be  mapped  onto  the  hardware  at 
the  specified  time. 

The  RSD  tools  are  used  in  the  CCS  management  sys¬ 
tem  for  describing  system  resources  and  user  requests.  At 
boot  time,  all  CCS  components  read  the  RSD  specification 
created  by  the  administrator  and  extract  the  relevant  infor¬ 
mation  (by  use  of  the  RsdAPI).  For  example,  the  MM  reads 
the  machine  topology  and  attributes,  whereas  the  QM  only 
extracts  information  such  as  the  number  of  PEs  or  the  avail¬ 
able  operating  system(s). 

The  UI  (User  Interface)  generates  an  RSD  description 
from  the  user’s  parameters  (or  from  a  given  RSD  descrip¬ 
tion)  and  sends  the  description  to  the  Access  Manager  (AM). 

The  AM,  which  is  responsible  for  authentication,  au¬ 
thorization,  and  accounting,  checks  whether  the  request 
matches  the  administrator’s  given  limits  and  forwards  it  to 
the  QM. 

The  QM  extracts  the  information,  computes  a  schedule 
and  sends  it  to  the  MM.  MM  verifies  this  schedule  by  map¬ 
ping  the  user  given  RSD  description  against  the  static  (e.g. 
topology)  and  dynamic  (e.g.  PE  availability)  information 
on  the  system  resources.  If  there  is  no  mapping  possible, 


UI 


Administrator 


Figure  3.  RSD  flow  in  the  CCS  system. 


the  MM  returns  an  alternative  schedule.  QM  either  accepts 
this  schedule  or  uses  it  to  compute  a  new  one. 

Although  all  CCS  components  are  based  on  RSD,  in  the 
past  we  disguised  the  complexity  of  the  RSD  language  by 
an  easy-to-use  command  line  interface.  There  was  no  need 
for  a  versatile  resource  description  facility  because  most  of 
the  systems  were  homogeneous,  their  topologies  simple  and 
regular,  and  nearly  all  applications  ran  on  only  one  system. 

With  the  trend  of  metacomputing  (now  often  called  grid 
based  computing),  resource  description  has  become  more 
and  more  important,  because  now  the  system  (instead  of 
the  user)  decides  which  of  the  available  resources  are  used. 
Hence,  the  users  need  a  convenient  tool  to  specify  their  re¬ 
quests,  and  the  applications  need  an  API  to  negotiate  their 
requirements  with  the  resource  management  system. 

The  RSD  systems  fulfill  both  requirements  by  providing 
the  RsdEditor  and  the  RsdAPI ,  respectively. 

2.3  Related  Work 

Like  RSD,  the  Globus  [9]  resource  specification  lan¬ 
guage  RSL  [10],  its  corresponding  metacomputer  directory 
service  MDS  [10]  and  the  underlying  LDAP  services  have 
also  been  devised  for  specifying  distributed  resources. 

However,  the  Globus  approach  is  somewhat  asymmet¬ 
ric:  it  uses  RSL  for  specifying  resource  requests  and  LDAP 
(based  on  X.500)  for  specifying  resource  offers. 

RSD,  in  contrast,  uses  the  same  representation  for  both 
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purposes,  thereby  allowing  us  to  use  common  graph  match¬ 
ing  mechanisms  for  brokerage. 

The  brokerage  aspect  is  emphasized  in  the  ClassAds  ap¬ 
proach  used  in  the  Condor  [11]  framework  for  matching  re¬ 
source  offers  with  requests.  Compared  to  RSD,  the  Clas¬ 
sAds  project  focuses  on  protocols  for  advertising  resources 
and  on  the  matchmaking  process,  rather  than  on  the  spec¬ 
ification  aspect.  As  a  result,  the  expressions  used  in  the 
ClassAds  seem  to  be  less  powerful  than  our  hierarchical, 
graph-based  RSD  expressions. 

The  Resource  Cataloging  and  Distribution  System 
RCDS  [12]  developed  at  the  University  of  Tennessee  is  an¬ 
other  interesting  approach.  RCDS  supports  flexible,  scal¬ 
able,  and  secure  access  to  various  types  of  data  (e.g.  files) 
on  WAN  connected  computers. 

Resources  are  named  by  URNs  (Uniform  Resource 
Names)  which  provides  stable  names  for  resources  which 
may  change  in  content  or  location  over  time.  This  is 
achieved  by  putting  resolution  servers  between  the  location 
dependent  URLs  and  the  end  user. 

A  middle  software  layer  guarantees  integrity  and  persis¬ 
tence  of  resources  in  an  environment  of  dynamically  chang¬ 
ing  information. 

3.  RsdEditor:  Features  and  Implementation 

Figure  4  shows  the  RsdEditor  starting  window.  Cur¬ 
rently,  it  is  possible  to  choose  between  two  different  lan¬ 
guages,  English  and  Italian.  Moreover,  it  can  operate  in 
Administrator  or  User  mode. 


Figure  4.  RsdEditor  Start  Window 

Figure  5  depicts  an  example  of  a  working  session.  A 
status  bar  is  shown  at  the  bottom  of  the  window  in  which 
error  and  information  messages  are  displayed.  The  central 
part  of  the  window,  called  the  workspace,  is  the  working 
area  for  the  graphical  resource  specification. 


Figure  5.  Example  of  a  work  session 

The  menu  bar  contains  the  following  items:  File,  Op¬ 
tions,  Node,  Edge,  Preferences,  Tree,  Topologies,  and 
Help. 

File  enables  the  creation/editing  of  a  resource  specifica¬ 
tion  file. 

Options  displays  the  current  RSD  file,  refreshes  the 
workspace,  etc.. 

Node  allows  the  creation,  editing,  or  deletion  of  a  node 
or  hypernode  (a  node  containing  other  nodes  in  a  recursive 
way).  In  the  RSD  syntax  a  node  represents  a  computational 
resource  characterized  by  graphical  and  RSD  attributes. 

Figures  6  and  7  show  the  definition  of  the  graphical  char¬ 
acteristics  and  the  assignment  of  RSD  attributes  to  a  node, 
respectively. 

The  RSD  syntax  requires  each  node  to  have  at  least  one 
port  (a  node’s  interface  toward  other  nodes)  in  order  to  link 
it  to  another  node  by  using  an  edge.  RSD  attributes  can  be 
assigned  to  a  port  (see  Figure  8). 

RSD  allows  nodes  to  be  defined  recursively  and  to  cre¬ 
ate  hypemodes.  A  hypernode  contains  the  specifications  of 
other  resources  such  as  nodes,  physical  and  virtual  edges. 
On  the  left  hand  side  of  Figure  5  the  hypemodes  and  nodes 
are  indicated  by  the  letters  H  and  N,  respectively. 

Edge  enables  the ‘creation,  editing,  or  deletion  of  a 
physical  or  virtual  edge.  A  physical  edge  represents  a  link 
between  two  nodes.  The  RSD  syntax  permits  uni-  or  bi¬ 
directional  physical  edges.  By  using  the  windows  shown  in 
Figures  9  and  10  it  is  possible  to  select  a  port  connected  by 
an  edge  (binding). 

The  edge  binding  is  an  RSD  syntax  constraint.  There¬ 
fore,  the  ports  must  be  defined  before  the  binding.  The  no¬ 
tion  of  a  virtual  or  vertical  edge  is  used  to  provide  a  link 
between  different  levels  of  a  hierarchy  in  the  specification 
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Figure  6.  Specification  of  the  graphical  prop¬ 
erties  of  a  node 
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Figure  7.  Specification  of  the  RSD  attributes 
of  a  node 


graph.  A  virtual  edge  is  defined  using  the  windows  shown 
in  Figures  1 1  and  12,  and  it  is  represented  by  an  arrow  (see 
Figure  5). 

Preferences  permits  the  definition  of  various  graphical 
features,  such  as  size,  shape,  color,  etc. 

Tree  enables  the  managing  of  the  resource  tree.  This  tree 
is  shown  in  a  synoptical  way;  it  is  thus  useful  for  the  user 
who  can  see  and  navigate  each  level  of  the  resource  specifi¬ 
cation  tree.  On  the  left  hand  side  of  Figure  5  an  example  of 
a  resource  specification  tree  is  shown. 

Topologies  allows  the  creation,  editing,  or  deletion 
of  nodes  representing  some  of  the  most  common  ho¬ 
mogeneous  interconnection  topologies  (Ring,  Grid,  Star, 
Torus).  Figure  13  shows  an  example  specifying  a  Grid 
composed  of  4  x  8  nodes.  This  prevents  the  user  from  hav¬ 
ing  to  manually  specify  32  nodes  and  52  edges. 

Help  accesses  the  on-line  manual. 

RsdEditor  saves  the  current  resource  specifications  by 
creating  two  files:  filename .  rsd  and  filename .  gui  con¬ 
taining  the  resource  specifications  in  RSD  syntax  and  the 
formal  descriptions  of  graphical  objects,  respectively,  file¬ 
name  is  the  name  specified  by  the  user  when  the  resource 
specification  is  created. 

The  graphical  interface  provides  the  option  of  importing 
and  exporting  resource  specifications,  some  of  which  may 
have  been  previously  recorded,  in  order  to  reuse  them.  As 
shown  in  Figure  14  RsdEditor  allows  the  RSD  code,  pro¬ 
duced  during  a  specification  phase,  to  be  displayed. 


For  portability  reasons,  RsdEditor  was  implemented  in 
Java  [13,  14]  and  it  has  been  tested  successfully  on  Mi¬ 
crosoft  Windows  (98,  NT),  RedHat  Linux  and  Sun  Solaris. 
The  modular  structure  adopted  to  implement  RsdEditor  fa¬ 
cilitates  its  maintenance  and  extension. 

A  more  detailed  description  of  the  RsdEditor  functions 
can  be  found  in  [15,  16]. 

4.  Example  of  RsdEditor  Utilization 

As  an  example  of  how  RsdEditor  can  be  used,  we  show 
how  to  describe  the  computing  resources  of  a  computing 
center.  Figure  15  shows  the  structure  of  the  Paderborn  Cen¬ 
ter  for  Parallel  Computing.  There  are  four  parallel  com¬ 
puters  (CC.48,  GCeLSystem,  SCI_64  and  GCPP-64)  con¬ 
nected  by  Ethernet,  and  a  computer  (Uranus)  acting  as  gate¬ 
way  towards  the  outside  world. 

As  sketched  in  Figure  16,  each  parallel  computer  (repre¬ 
sented  by  a  torus  icon)  and  the  gateway  are  connected  to  a 
central  node  representing  the  Ethernet  hub.  Those  nodes  are 
included  in  a  hypernode  denoted  as  PaderbornJPark.  In 
order  to  specify  the  attributes  of  each  computer  the  windows 
shown  in  Figures  13  need  to  be  used. 

The  complete  resource  description  automatically  pro¬ 
duced  by  RsdEditor  is  shown  in  the  following.  It  is  worth 
highlighting  the  usefulness  of  RsdEditor  by  looking  at  the 
pages  containing  the  RSD  code.  In  fact,  the  resource  speci¬ 
fications  made  by  the  RSD  language  is  a  relatively  long  and, 
potentially  error  prone,  task. 
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Figure  8.  Specification  of  port  attributes 


Resource  Description  Produced  by  RsdEditor 


ROOTNODE  Paderborn_Park 
{ 

NODE  Ethernet 
{ 

PORT  Ethl  {  Type  =  Ethernet;  }; 

PORT  Eth2  {  Type  =  Ethernet;  }; 

PORT  Eth3  {  Type  =  Ethernet;  }; 

PORT  Eth4  {  Type  =  Ethernet;  }; 

PORT  Eth5  {  Type  =  Ethernet;  }; 

PORT  Eth6  {  Type  =  Ethernet;  }; 

Bandwidth  -  100; 

} ; 

NODE  Uranus 
{ 

PORT  Ethernet; 

PORT  ATM; 

> ; 

NODE  CC_48 
C 

CONST  n  =  2; 

CONST  m  =  24; 

FOR  i=l  TO  n  DO 
FOR  i=l  TO  n  DO 

NODE  Torus_$i_$ j 
{ 

PORT  cc48 ; 

IF  ( (i  =  l)  &&  <  j  =1 ) )  THEN 
PORT  CC_48-Esterna; 

FI 

CPU  =  PowerPC; 

Memory  =  64MByte; 
PeakPerformance  =  12,76GFlops; 
SysOp  =  AIX4 . 1 ; 

} ; 

OD 

OD 

FOR  i=l  TO  n-1  DO 


OD  '  Edge  properties  iBS 


Figure  9.  Specification  of  the  RSD  attributes 
of  a  physical  edge 


FOR  j=l  TO  m  DO 

EDGE  Edge_$i_$j_to„$i_$ ( ( j+1)  MOD  m) 

{ 

NODE  Torus_$i_$j  PORT  cc48  <=> 

NODE  Torus„$i_$ ( ( j+1)  MOD  m)  PORT  cc48; 

} ; 

OD 

OD 

FOR  j=l  TO  m-1  DO 
FOR  i=l  TO  n  DO 

EDGE  Edge„$i_$ j__to_$  (  (i+1)  MOD  n)_$j 

{ 

NODE  Torus_$i_$j  PORT  cc48  <=> 

NODE  Torus_$ ( (i+1)  MOD  n)_$j  PORT  cc48; 

}; 

OD 

OD 

ASSIGN  Torus_l_l  PORT  CC_48-Esterna; 

} ; 

NODE  GCel_System 

{ 

CONST  n  =  32; 

CONST  m  =  32; 

FOR  i=l  TO  n  DO 
FOR  i=l  TO  n  DO 

NODE  Torus_$i_$j 

{ 

PORT  TransputerLink; 

IF  ( (i=l)  &&  { j=l) )  THEN 
PORT  GCel-Esterna; 

FI 

CPU  =  T805  ; 

Memory  =  4MByte; 
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fin  Edge  properties  OPES 


Figure  10.  Specification  of  the  RSD  attributes 
of  a  physical  edge 


PeakPerf ormance  =  4.4GFlops; 

} ; 

OD 

OD 

FOR  i  =  l  TO  n-1  DO 
FOR  j  =  l  TO  m  DO 

EDGE  Edge_$i_$j_to_$i_$ ( < j+1)  MOD  m) 

{ 

NODE  Torus__$i_$ j  PORT  TransputerLink  <=> 
NODE  Torus_$i_$ ( (j+1)  MOD  m> 

PORT  TransputerLink; 

} ; 

OD 

OD 

FOR  j=l  TO  m-1  DO 
FOR  i=l  TO  n  DO 

EDGE  Edge„$i_$j_to_$( (i+1)  MOD  n)_$j 

{ 

NODE  Torus__$i_$j  PORT  TransputerLink  <=> 
NODE  Torus_$ ( { i+1)  MOD  n)_$j 
PORT  TransputerLink; 

} ; 

OD 

OD 

ASSIGN  Torus_l_l  PORT  GCel-Esterna ; 

} ; 

NODE  GCPP_64 

{ 

CONST  n  =  4; 

CONST  m  =  8; 

FOR  i  =  l  TO  n  DO 
FOR  i=l  TO  n  DO 

NODE  Torus_$i_$j 


0(5  Virtual  edge  properties 


OK  Cancel 


Figure  11.  Specification  of  the  RSD  attributes 
of  a  virtual  edge 


( 

PORT  gcpp; 

IF  (  (i  =  l)  Sc&  ( j  =  l)  )  THEN 
PORT  GCPP-Esterna; 

FI 

CPU  =  Power PC601; 

CPUNumber  =  2; 

Memory  =  32MByte; 

PeakPerf ormance  =  5.12GFlops; 

} ; 

OD 

OD 

FOR  i=l  TO  n-1  DO 
FOR  j=l  TO  m  DO 

EDGE  Edge_$i_$j_to_$i_$ ( ( j+1)  MOD  m) 

{ 

NODE  Torus_$i_$j  PORT  gcpp  <-> 

NODE  Torus_$i_$ ( ( j+1)  MOD  m)  PORT  gcpp; 

} ; 

OD 

OD 

FOR  j=l  TO  m-1  DO 
FOR  i=l  TO  n  DO 

EDGE  Edge_$i_$j_to_$ ( (i+1)  MOD  n)_$j 

{ 

NODE  Torus_$i_$j  PORT  gcpp  <=> 

NODE  Torus_$ ( (i+1)  MOD  n)_$j  PORT  gcpp; 

} ; 

OD 

OD 

ASSIGN  Torus_l_l  PORT  GCPP-Esterna; 

} ; 

NODE  SCI_64 

( 

CONST  n  =  4; 

CONST  m  =  8; 

FOR  i=l  TO  n  DO 
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Figure  12.  Specification  of  the  RSD  attributes  Figure  13.  Supported  topologies 

of  a  virtual  edge 


FOR  i=l  TO  n  DO 

NODE  Torus_$i_$j 

{ 

PORT  Sci64 ; 

IF  < ( i=l )  &&  { j=l) )  THEN 
PORT  SCI_64-Esterna; 

FI 

CPU  =  Pentiumll; 

CPUNumber  =  2; 

Memory  =  256MByte; 

PeakPerformance  =  19.2GFlops; 

}? 

OD 

OD 

FOR  i=l  TO  n-1  DO 
FOR  j=l  TO  m  DO 

EDGE  Edge_$i_$j_to_$i_$ ( { j+1)  MOD  m) 

{ 

NODE  Torus_$i_$j  PORT  sci64  <=> 

NODE  Torus_$i_$ { (j+1)  MOD  m)  PORT  sci64; 

Bandwidth  =  500MByte/ s; 

} ; 

OD 

OD 

FOR  j=l  TO  m-1  DO 
FOR  i=l  TO  n  DO 

EDGE  Edge_$i_$ j_to_$ { (i+1)  MOD  n)_$j 

{ 

NODE  Torus_$i_$j  PORT  sci64  <=> 

NODE  Torus_$ ( (i+1)  MOD  n)_$j  PORT  sci64; 

Bandwidth  =  5 00MByte/ s; 

} ; 

OD 

OD 

ASSIGN  Torus_l„l  PORT  SCI_64-Esterna; 


{ 

NODE  Uranus  PORT  Ethernet  <=>  NODE  Ether¬ 
net  PORT  Ethl ; 

Type  =  Ethernet; 

} ; 

EDGE  Edgel 

{ 

NODE  Ethernet  PORT  Eth2  <=> 

NODE  GCel_System  PORT  GCel-Esterna; 

Type  =  Ethernet; 

} ; 

EDGE  Edge2 

{ 

NODE  Ethernet  PORT  Eth3  <=>  NODE  SCI_64  PORT  SCI_64- 
Esterna; 

Type  =  Ethernet; 

}? 

EDGE  Edge 3 

{ 

NODE  Ethernet  PORT  Eth4  <=>  NODE  GCPP_64  PORT  GCPP- 
Esterna; 

Type  =  Ethernet; 

>; 

EDGE  Edge 4 

{ 

NODE  Ethernet  PORT  Eth5  <=>  NODE  CC_48  PORT  CC_48- 
Esterna; 

Type  =  Ethernet; 

} ; 

ASSIGN  NODE  Uranus  PORT  ATM; 

} 


5.  Summary 

In  this  paper  we  have  presented  RsdEditor,  a  graphical 
editor  for  specifying  computational  resources  and  services 
in  distributed  environments.  Computing  components  (com- 
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Figure  14.  RSD  code  generation 

puters,  processors)  are  represented  by  nodes  and  their  inter¬ 
connects  (WAN,  LAN,  or  internal  computer  links)  by  edges. 
Both  may  be  attributed. 

Compared  to  other  approaches  RSD  is  used  by  both, 
users  and  administrators,  thereby  allowing  the  use  of  simple 
graph  matching  algorithms  for  mapping  resource  requests 
onto  resource  offers. 

RsdEditor  currently  generates  (via  the  RSD  language) 
C++  objects.  For  improved  portability,  we  plan  to  adapt  Rs¬ 
dEditor  to  generate  XML  code.  In  addition,  work  is  under 
way  to  implement  resource  brokers  with  different  strategies 
on  top  of  the  RSD  framework. 
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Abstract 

The  Computational  Grid  provides  a  promising  plat¬ 
form  for  the  efficient  execution  of  parameter  sweep  ap¬ 
plications  over  very  large  parameter  spaces.  Scheduling 
such  applications  is  challenging  because  target  resources 
are  heterogeneous,  because  their  load  and  availability 
varies  dynamically,  and  because  independent  tasks  may 
share  common  data  files.  In  this  paper,  we  propose  an 
adaptive  scheduling  algorithm  for  parameter  sweep  ap¬ 
plications  on  the  Grid.  We  modify  standard  heuristics 
for  task/host  assignment  in  perfectly  predictable  envi¬ 
ronments  (Max-min,  Min-min,  Sufferage/,  and  we  pro¬ 
pose  an  extension  of  Sufferage  called  XSuffer age.  Using 
simulation,  we  demonstrate  that  XSufferage  can  take 
advantage  of  file  sharing  to  achieve  better  performance 
than  the  other  heuristics.  We  also  study  the  impact  of 
inaccurate  performance  prediction  on  scheduling.  Our 
study  shows  that:  (i)  different  heuristics  behave  dif¬ 
ferently  when  predictions  are  inaccurate;  (ii)  increased 
adaptivity  leads  to  better  performance . 


1.  Introduction 

Fast  networks  make  it  possible  to  aggregate  CPU, 
network  and  storage  resources  into  Computational 
Grids  [8].  Such  environments  can  be  used  effectively 

This  research  was  supported  in  part  by  NSF  Grant  ASC- 
9701333,  NASA/NPACI  Contract  AD-435-5790,  DARPA/ITO 
under  contract  #N66001-97-C-8531,  and  CNRS/INRIA  project 
ReMaP 


to  support  very  large-scale  runs  of  distributed  appli¬ 
cations.  An  ideal  class  of  applications  for  the  Grid  is 
the  class  of  parameter  sweep  applications ,  applications 
structured  as  a  set  of  multiple  ’’experiments”,  each  of 
which  is  executed  with  a  distinct  set  of  parameters. 

Executing  a  parameter  sweep  on  the  Grid  involves 
the  assignment  of  tasks  to  resources.  Although  the 
experiments  (or  tasks)  of  a  parameter  sweep  applica¬ 
tion  are  independent,  a  number  of  issues  make  schedul¬ 
ing  such  applications  challenging.  First,  resources  in 
the  Grid  are  typically  shared  so  that  the  contention 
created  by  multiple  applications  creates  dynamically 
fluctuating  delays  and  qualities  of  service.  In  addi¬ 
tion,  Grid  resources  are  heterogeneous  and  may  not 
perform  similarly  for  the  same  application.  Moreover, 
although  parameter  sweep  tasks  are  independent,  they 
may  share  common  input  files  which  reside  at  remote 
locations,  hence  the  performance-efficient  assignment 
and  scheduling  of  the  application  must  include  con¬ 
sideration  of  the  impact  of  data  transfer  times.  Previ¬ 
ous  work  [3]  has  demonstrated  that  run-time,  adaptive, 
application-scheduling  based  on  dynamic  information 
about  the  status  of  computing  resources  is  a  good  gen¬ 
eral  approach  for  achieving  performance  on  the  Grid. 

In  [20],  three  heuristics  (Max-min,  Min-min  and 
Sufferage )  were  proposed  for  the  scheduling  of  indepen¬ 
dent  tasks  in  single-user,  homogeneous  environments. 
In  this  paper,  we  modify  existing  heuristics  to 
schedule  parameter  sweep  applications  with  file 
I/O  requirements,  we  propose  an  extended  ver¬ 
sion  of  Sufferage,  XSufferage,  and  we  study  the 
impact  of  inaccurate  performance  prediction  on 
scheduling.  We  integrate  these  heuristics  into  a  gen- 
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eral  adaptive  scheduling  algorithm  and  compare  them 
in  various  simulated  computing  environments  and  for 
various  application  scenarios.  We  will  use  a  standard 
performance  metric  to  evaluate  our  heuristics:  the  ap¬ 
plication  makespan  [22],  i.e.  the  time  between  the  first 
input  files  is  sent  to  a  computational  server  and  the  last 
output  file  is  returned  to  the  user.  Our  ultimate  goal  is 
to  include  our  adaptive  scheduling  algorithm  in  a  soft¬ 
ware  framework,  a  Parameter  Sweep  Template  (PST), 
developed  as  part  of  the  AppLeS  project  [13].  PST  will 
be  the  subject  a  a  future  paper. 

In  a  Grid  environment  it  is  usually  difficult  to  obtain 
accurate  predictions  for  computing  and  networking  re¬ 
source  performance;  moreover  most  scheduling  heuris¬ 
tics  make  use  of  such  predictions.  We  designed  a  simu¬ 
lator  that  allows  us  to  experiment  with  different  levels 
of  performance  prediction  accuracy.  In  this  paper  we 
present  a  preliminary  study  of  the  effect  of  increasing 
inaccuracy  on  the  heuristics  under  consideration  and 
discuss  how  adaptivity  can  be  used  to  promote  perfor¬ 
mance  in  Grid  environments. 

This  paper  is  organized  as  follows.  In  Section  2, 
we  present  our  models  for  both  the  application  and 
the  underlying  Grid  environment.  In  Section  3,  we 
present  our  scheduling  algorithm.  Section  4  focuses  on 
the  different  task/host  assignment  heuristics  whereas 
Sections  5  discusses  adaptivity  and  performance  pre¬ 
diction  accuracy.  Section  7  references  related  research 
work,  and  Section  8  concludes  the  paper. 

2.  A  Scheduling  Model  for  Parameter 
Sweeps  on  the  Grid 

2.1.  Application  Model 

We  define  a  parameter  sweep  application  as  a  set 
of  n  independent  sequential  tasks  {7*}*= By  inde¬ 
pendent  we  mean  that  there  are  no  inter-task  commu¬ 
nications  or  data  dependencies  (i.e.  task  precedences). 
We  assume  that  the  input  to  each  task  is  a  set  of  files 
and  that  a  single  file  might  be  input  to  more  than  one 
task.  In  our  model,  without  loss  of  generality,  each 
task  produces  exactly  one  output  file.  Figure  1  shows 
an  example  with  input  file  sharing  among  tasks.  We 
assume  that  the  size  of  each  input  and  output  file  is 
known  a-priori. 

This  model  is  motivated  by  our  primary  target  appli¬ 
cation  for  PST:  MCell  [29],  a  micro-physiology  applica¬ 
tion  that  uses  3-D  Monte-Carlo  simulation  techniques 
to  study  molecular  bio-chemical  interactions  within  liv¬ 
ing  cells.  An  MCell  run  is  composed  of  multiple  Monte- 
Carlo  simulations  for  cell  regions  whose  geometries  are 
described  in  (potentially  very  large)  files.  For  instance, 


Output 

files 


Figure  1.  Application  Model 


MCell  can  be  used  to  study  the  trajectories  of  neuro¬ 
transmitters  in  the  3-D  space  between  two  cell  mem¬ 
branes  for  different  deformations  of  the  membranes, 
where  each  deformation  is  described  in  a  geometry  file. 
Additional  files  of  variable  sizes  are  also  needed  for  de¬ 
scribing  the  initial  locations  of  different  molecules.  The 
model  described  above  is  adequate  for  our  purpose  and 
should  be  general  enough  to  accommodate  other  appli¬ 
cations  (e.g.  general  Monte-Carlo  simulations). 

MCell  users  and  developers  anticipate  large-scale 
runs  that  contain  tens  of  thousands  of  tasks  with  each 
task  processing  hundreds  of  MBytes  of  input  and  out¬ 
put  data,  with  various  task-file  usage  patterns.  Fur¬ 
thermore,  research  work  outside  the  scope  of  this  paper 
addresses  the  question  of  steerable  MCell  runs  when 
users  can  add  new  tasks  on-the-fly,  and  modify  the 
computational  targets  of  existing  tasks.  Such  runs  will 
lead  to  fairly  intricate  task-file  usage  patterns  and  a 
model  as  general  as  the  one  we  describe  will  be  needed 
to  study  scheduling  issues  in  the  presence  of  computa¬ 
tional  steering. 

2.2.  Grid  Model 

We  assume  that  the  Computational  Grid  available 
to  the  user  has  the  following  topology:  it  is  a  set  of 
k  clusters  of  computing  resources  that  are 

accessible  to  the  user  via  k  distinct  network  links.  This 
is  a  logical  topology,  and  this  work  does  not  attempt  to 
take  into  account  the  actual  physical  network  topology 
of  the  Grid.  Our  intent  is  to  model  a  wide-area  sys¬ 
tem,  such  as  a  Worldwide  Flock  of  Condors  [24]  for  in¬ 
stance.  Each  cluster  contains  a  certain  number  of  hosts 
where  a  host  can  be  any  computing  platform,  from  a 
single-processor  workstation  to  an  MPP  system,  and 
is  available  for  computation.  From  now  on,  we  call 
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Figure  2.  Environment  Models 


both  hosts  and  network  links  resources.  We  do  not  im¬ 
pose  any  constraint  on  the  performance  characteristics 
of  the  resources  and  our  simulator  allows  for  arbitrary 
performance  variation.  The  only  requirement  is  that 
for  each  computation  and  file  transfer,  an  estimate  of 
running  time  is  available.  On  interactive  hosts,  the 
estimate  is  the  task  execution  time  whereas  on  batch 
resources,  it  is  the  turnaround  time  (defined  as  the 
waiting  time  plus  the  execution  time).  Such  estimates 
can  be  provided  by  the  user,  computed  from  analytical 
models  or  historical  information,  or  provided  by  facili¬ 
ties  such  as  the  Network  Weather  Service  (NWS)  [31], 
ENV  [28],  Remos  [19],  Grid  services  such  as  those  found 
in  Globus  [10],  or  computed  from  a  combination  of  the 
above.  Recent  work  shows  that  the  accuracy  of  the  per¬ 
formance  estimates  have  an  impact  on  the  effectiveness 
of  scheduling  heuristics  and  that  information  about  the 
probability  distribution  of  the  estimates  should  be  ex¬ 
ploited  for  scheduling  [18,  26].  We  explore  scenarios  for 
different  levels  of  estimate  accuracy  in  Section  5. 

We  assume  that  a  storage  facility  (e.g.  NFS, 
GASS  [9],  IBP  [23])  is  available  at  each  cluster  so  that 
files  can  be  shared  among  the  processes  running  on  dif¬ 
ferent  hosts  in  the  cluster.  For  the  first  implementation 
of  PST,  we  are  planning  to  use  IBP  for  storage  man¬ 
agement.  Figure  2  shows  an  example  of  a  Grid  of  a 
Grid  with  three  clusters.  In  this  work  we  assume  that 
all  input  files  are  initially  stored  on  the  user’s  host, 
that  all  output  files  must  be  returned  to  this  location, 
and  that  there  are  no  inter-cluster  file  exchanges.  We 
assume  for  now  that  once  assigned,  tasks  do  not  mi¬ 
grate  between  resources.  This  scenario  fits  the  current 
usage  of  several  real-life,  parameter  sweep  applications 
(e.g.  MCell,  INS2D  [25]),  and  we  leave  alternate  usage 
scenarios  for  future  work.  In  this  work  we  ignore  pos¬ 
sible  storage  constraints  and  assume  unlimited  storage 
space.  Our  model  assumptions  are  discussed  in  the  fol¬ 


lowing  section,  but  we  believe  that  they  make  it  possi¬ 
ble  to  obtain  initial  meaningful  results  about  a  realistic 
environment  while  keeping  the  simulation  tractable. 

2.3.  Model  Discussion 

Our  Grid  model  makes  several  simplifying  assump¬ 
tions.  Even  though  we  allow  network  links  to  have  ar¬ 
bitrary  dynamic  performance  characteristics,  we  do  not 
model  network  contention  caused  by  the  application  it¬ 
self.  Instead  we  view  the  network  as  a  set  of  distinct 
links  emanating  from  the  user’s  host  and  that  can  all  be 
used  in  parallel.  We  believe  that  this  assumption  will 
need  to  be  relaxed  in  future  work.  Since  our  purpose  is 
not  simulation  per  se,  we  will  aim  at  using  simulators 
developed  by  other  research  groups.  For  instance,  the 
Micro-Grids  simulator  [15],  when  it  becomes  available, 
will  allow  us  to  precisely  simulate  network  contention 
and  study  its  impact  on  our  current  results. 

Similarly,  we  do  not  take  into  account  contention 
within  a  cluster  for  shared  file  access.  Our  justifica¬ 
tion  is  that  wide-area  file  transfer  cost  dominate  the 
cost  of  file  access  within  the  cluster,  even  in  the  pres¬ 
ence  of  contention.  While  this  is  true  in  certain  envi¬ 
ronments,  it  is  certainly  not  general  and  we  will  need 
to  enhance  our  own  simulator  so  that  it  can  simulate 
model  contention  for  shared  storage.  For  instance,  this 
will  be  necessary  to  simulate  high-bandwidth  wide-area 
research  networks  such  as  the  vBNS.  At  the  moment 
we  are  planning  to  deploy  the  PST  software  on  non- 
dedicated  commercial  wide-area  networks  with  many 
clusters  and  we  believe  our  simulation  results  will  hold 
in  those  environments.  The  assumption  of  unlimited 
storage  is  realistic  for  current  runs  of  MCell  on  our 
current  testbed,  but  that  assumption  will  be  relaxed  in 
future  versions  of  our  scheduling  algorithm. 

Our  Grid  model  also  assumes  that  there  are  no  di¬ 
rect  network  links  between  clusters  in  the  sense  that 
file  transfers  cannot  be  performed  by  our  scheduling 
algorithm  between  clusters.  In  other  words,  the  only 
authoritative  source  of  input  files  is  the  user’s  host. 
This  prevents  schedulers  from  making  some  optimiza¬ 
tions  when  disseminating  input  files  among  the  clus¬ 
ters.  However,  no  heuristic  we  study  in  this  paper  is 
able  to  support  such  optimizations  as  this  would  re¬ 
quire  a  considerably  more  precise  understanding  of  the 
network.  Our  next  step  in  this  research  will  be  to  use 
a  more  complete  network  model  and  to  consider  any 
storage  device  for  any  file  retrieval.  This  will  allow  not 
only  for  more  flexible  application  scenarios,  but  also 
for  the  investigation  of  more  sophisticated  scheduling 
algorithms. 
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schedule ()  { 

(1)  compute  the  next  scheduling  event 

(2)  create  a  Gantt  Chart,  G 

(3)  foreach  computation  and  file  transfer  currently  underway 

compute  an  estimate  of  its  completion  time 
fill  in  the  corresponding  slots  in  G 

(4)  select  a  subset  of  the  tasks  that  have  not  started  execution:  T 

(5)  until  each  host  has  been  assigned  enough  work 

heuristically  assign  tasks  to  hosts  (filling  slots  in  G) 

(6)  convert  G  into  a  plan 


Figure  3.  Scheduling  Algorithm  Skeleton 


3.  Adaptive  Scheduling  for 
Parameter  Sweeps 

3.1.  The  Scheduling  Algorithm 

We  call  our  scheduling  algorithm  schedule ().  The 
general  strategy  is  that  it  takes  into  account  resource 
performance  estimates  to  generate  a  plan  for  the  assign¬ 
ing  file  transfers  to  network  links  and  tasks  to  hosts. 
To  account  for  the  Grid’s  dynamic  nature,  schedule () 
can  be  called  repeatedly  so  that  the  schedule  can  be 
modified  and  refined.  We  denote  the  points  in  time 
at  which  schedule ()  is  called  scheduling  events ,  ac¬ 
cording  to  the  terminology  in  [20].  We  assume  that  at 
each  scheduling  event  our  scheduler  has  knowledge  of: 

(i)  the  current  topology  of  the  Grid  (number  of  clus¬ 
ters,  number  of  hosts  in  those  clusters,  network  and 
CPU  loads),  (ii)  the  number  and  location  of  copies  of 
all  input  files,  and  (iii)  the  list  of  computations  and  file 
transfers  currently  underway  or  already  completed. 

Figure  3  shows  the  general  skeleton  for  schedule  () 
whose  steps  can  be  described  as  follows: 

(1)  determines  the  time  of  the  next  scheduling  event. 
This  can  take  into  account  environment  behavior 
to  increase  or  decrease  the  scheduling  event  fre¬ 
quency.  A  higher  frequency  means  a  higher  adap¬ 
tivity  but  also  a  higher  scheduling  cost. 

(2)  creates  a  Gantt  chart  [7],  G,  that  will  be  used  to 
keep  track  of  task/host  assignments.  G  contains 
as  many  columns  as  resources.  Figure  4  shows 
an  example  of  a  Gantt  chart  for  an  environment 
containing  two  clusters  with  respectively  two  and 
three  hosts. 


(3)  inserts  slots  corresponding  to  tasks  that  are  cur¬ 
rently  running  into  the  chart.  Two  examples  are 
shown  on  Figure  4  as  black-filled  rectangular  slots 
at  the  beginning  of  the  chart  (one  file  transfer  and 
one  computation). 

(4)  performs  a  task-space  reduction  that  can  be  used 
to  reduce  schedule  ()’s  execution  time.  This  will 
be  necessary  for  runs  of  real  parameter  sweep  ap¬ 
plications  since  we  expect  them  to  contain  thou¬ 
sands  of  tasks. 

(5)  is  the  core  of  the  algorithm,  determining  which 
task  should  be  performed  on  which  host.  This 
step  is  detailed  in  Section  4.  Examples  of  slot  as¬ 
signments  are  depicted  on  Figure  4  in  gray.  In  this 
example,  input  file  transfers  are  scheduled  on  the 
network  link  to  cluster  2,  the  computation  is  then 
scheduled  on  a  host  within  that  cluster,  and  the 
output  file  is  scheduled  to  be  returned  to  the  user’s 
host. 

(6)  converts  the  Gantt  chart  into  a  plan ,  or  a  sequence 
of  instructions.  These  instructions  can  then  pro¬ 
vide  a  schedule  for  deployment  with  Grid  software 
services  (for  job  submission  and  monitoring,  data 
motion,  etc.). 

3.2.  Discussion 

Several  steps  of  our  scheduling  algorithm  can  be  im¬ 
plemented  independently  and  this  makes  it  possible 
to  experiment  with  different  techniques  and  strategies. 
Our  ultimate  goal  is  to  instantiate  the  algorithm  so 
that  it  is  optimized  for  specific  environments  and  ap¬ 
plications.  Furthermore,  this  instantiation  should  be 
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Figure  4.  Sample  Gantt  chart 


as  dynamic  and  automatic  as  possible  as  the  algorithm 
should  be  able  to  reconfigure  itself  on-the-fly  to  accom¬ 
modate  changing  Grid  conditions. 

Step  (1)  allows  for  dynamic  adjustment  of  the 
scheduling  event  frequency.  A  higher  frequency  leads  to 
more  adaptivity  and  should  be  better  for  unstable  Grid 
environments.  However,  a  higher  frequency  also  means 
that  schedule  ()  is  called  more  frequently.  Depending 
on  the  computational  cost  of  the  scheduling,  a  high  fre¬ 
quency  might  not  be  desirable.  In  the  case  of  the  PST 
software,  a  processor  is  usually  dedicated  to  scheduling. 
Furthermore,  given  the  granularity  of  the  applications 
we  are  considering,  there  should  be  no  need  for  very 
high  frequencies.  However,  steps  (3)  and  (5)  might  in¬ 
clude  remote  access  to  Grid  information  services  (e.g. 
NWS  [31])  to  perform  performance  prediction.  It  may 
then  become  necessary  to  reduce  the  scheduling  event 
frequency  because  of  latencies  associated  with  Grid  ser¬ 
vices.  One  can  envision  algorithms  to  dynamically  tune 
the  frequency  in  step  (1).  For  instance,  one  could  com¬ 
pute  the  deviation  of  the  computation  from  what  was 
planned  during  the  previous  call  to  schedule () .  Large 
deviations  suggest  higher  frequencies  whereas  low  de¬ 
viations  suggest  that  the  frequency  can  be  decreased. 

Step  (3)  obtains  estimates  for  completion  of  ongoing 
file  transfers  and  computations  in  order  to  start  filling 
in  slots  in  the  Gantt  chart.  This  is  a  little  different 
from  estimating  just  running  times  because  more  infor¬ 
mation  is  available.  It  is  indeed  conceivable  that  more 
precise  forecasting  techniques  can  be  used  because  the 
required  prediction  is  in  the  near  future  and  because 
there  are  ways  to  compute  percentage  to  completion. 
It  may  be  that  applications  provides  means  to  check  on 
computation  progress  (this  is  however  not  the  case  for 
MCell).  More  generally,  techniques  using  historical  in¬ 


formation  from  Grid  services  can  lead  to  estimations  of 
the  percentage  to  completion.  We  have  started  exper¬ 
imenting  with  such  techniques  and  will  present  results 
in  a  subsequent  paper. 

Step  (5)  in  Figure  3  states  that  tasks  are  assigned 
to  hosts  until  ’’enough”  work  has  been  assigned.  Like 
step  (4),  this  is  intended  to  limit  the  time  spent  com¬ 
puting  the  schedule.  Indeed,  it  makes  little  sense  to 
assign  tasks  to  hosts  for  times  that  are  well  beyond 
the  next  scheduling  event  since  the  schedule  will  be  re¬ 
evaluated  then.  Since  real  runs  will  not  be  perfectly 
predictable,  it  is  good  practice  to  leave  some  margin  of 
error  and  assign  work  until  after  the  next  scheduling 
event  so  that  resources  are  utilized  even  if  the  perfor¬ 
mance  predictions  were  pessimistic. 

Step  (6)  processes  the  Gantt  chart  and  transforms  it 
into  a  set  of  task  lists  associated  to  each  resource.  The 
Grid  infrastructure  software  in  use  is  then  responsible 
for  sequencing  file  transfers  and  computations  on  the 
appropriate  resources.  Here  there  is  some  latitude  for 
some  choices  concerning  the  actual  implementation  of 
the  task  sequencing.  It  may  be  that,  due  to  unexpected 
performance  misprediction,  some  resource  cannot  exe¬ 
cute  the  next  task  on  its  list  but  could  execute  one 
of  the  subsequent  ones.  For  instance,  a  file  transfer  for 
the  output  of  a  task  that  is  unexpectedly  lagging  might 
cause  a  network  link  to  stay  unused.  A  solution  is  to 
relax  the  ordering  of  the  list  and  allow  subsequent  file 
transfers  to  be  performed  immediately. 

Our  experience  indicates  that  allowing  output  file 
transfers  to  be  delayed  until  they  can  effectively  occur 
is  usually  a  good  idea  as  it  allows  for  better  network 
bandwidth  utilization  while  not  disrupting  the  overall 
schedule  to  a  great  extent.  This  is  the  scheme  used  by 
schedule  ()  in  this  paper.  Further  experiments  would 
be  required  in  order  to  investigate  the  trade-offs  be¬ 
tween  resource  utilization  and  schedule  disruption. 

Steps  (1)  and  (5)  use  dynamic  information  about 
the  status  of  the  Grid  resources  and  are  key  to  the  al¬ 
gorithm  efficacy.  Our  main  focus  in  this  paper  is 
step  (5)  of  the  algorithm  and  our  results  are  pre¬ 
sented  in  the  following  section.  We  also  present  pre¬ 
liminary  experiments  concerning  step  (1)  in  Section  5. 

4.  Performing  Task/Host  Assignment 
Decisions 

4.1.  Heuristics 

We  must  identify  heuristics  that  are  applicable  in 
Grid  environments  to  perform  assignment  of  file  trans¬ 
fers  to  network  links  and  of  computations  to  hosts. 
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Moreover  these  heuristics  must  be  reasonably  compu¬ 
tationally  inexpensive  with  respect  to  the  duration  of 
a  typical  application  task.  Three  simple  heuristics  for 
scheduling  independent  tasks  for  a  uniform  single-user 
environment  are  proposed  in  [17,  20]:  Min-min ,  Max - 
min ,  and  Sufferage .  These  three  heuristics  iteratively 
assign  tasks  to  processors  by  considering  all  tasks  not 
scheduled  and  computing  Minimum  Completion  Times 
(MCTs).  For  each  task,  this  is  done  by  tentatively 
scheduling  it  to  each  resource,  estimating  the  task’s 
completion  time,  and  computing  the  minimum  com¬ 
pletion  time  over  all  resources.  For  each  task,  a  metric 
is  computed  using  these  MCTs,  and  the  task  with  the 
’’best”  metric  is  assigned  to  the  resource  that  lets  it 
achieve  its  MCT.  The  process  is  then  repeated  until  all 
tasks  have  been  scheduled. 

Min-min  uses  the  Minimum  MCT  as  a  metric,  mean¬ 
ing  that  the  task  that  can  be  complete  the  earliest 
is  given  priority.  The  motivation  behind  Min-min  is 
that  assigning  tasks  to  hosts  that  will  execute  them 
the  fastest  will  lead  to  an  overall  reduced  makespan. 
Max-miri* s  metric  is  the  Maximum  MCT.  The  expecta¬ 
tion  is  to  overlap  long-running  tasks  with  short-running 
ones.  The  rationale  behind  Sufferage  is  that  a  host 
should  be  assigned  to  the  task  that  would  “suffer”  the 
most  if  not  assigned  to  that  host.  For  each  task,  its 
sufferage  value  is  defined  as  the  difference  between  its 
best  MCT  and  its  second-best  MCT.  Tasks  with  high 
sufferage  value  take  precedence.  Note  that  this  def¬ 
inition  of  sufferage  is  a  little  different  from  the  one 
presented  in  [20].  We  found  our  definition  easier  to 
implement  and  experiments  showed  no  differences  be¬ 
tween  our  version  of  sufferage  and  the  one  in  [20].  We 
modified  all  three  heuristics  so  that  they  (i)  include 
input  and  output  data  transfer  times  when  computing 
MCTs  and  (ii)  take  into  account  the  fact  that  some  files 
may  already  be  present  on  remote  storage  devices.  In 
addition,  we  implemented  an  extended  version  of  the 
Sufferage  heuristic:  XSufferage. 

In  XSufferage  the  sufferage  value  is  computed  not 
with  MCTs,  but  with  cluster-level  MCTs ,  i.e.  by  com¬ 
puting  the  minimum  MCTs  over  all  hosts  in  each  clus¬ 
ter.  Our  first  intuition  was  that  Sufferage  should  be  a 
nice  way  to  exploit  file  locality  issues  without  any  a- 
priori  analysis  of  the  task-file  dependence  pattern.  The 
idea  is  that  if  a  file  required  by  some  task  is  already 
present  at  a  remote  cluster,  that  task  would  ’’suffer” 
if  not  assigned  to  a  host  in  that  cluster,  provided  the 
file  is  large  compared  to  the  available  bandwidth  on 
the  cluster’s  network  link.  The  sufferage  value  would 
then  be  a  simple  way  of  capturing  such  situations  and 
ensuring  maximum  file  re-use.  This  is  somewhat  rem¬ 
iniscent  of  the  idea  of  task/host  affinities  introduced 


in  [20],  where  some  hosts  are  better  for  some  tasks  but 
not  for  others. 

However  that  early  experiments  showed  that  the 
Sufferage  heuristic  as  described  above  does  not  lead 
to  makespans  as  good  as  the  ones  we  expected.  This 
can  be  explained  easily.  Assume  that  a  task,  say  To, 
requires  a  large  input  file  that  is  already  stored  on  a 
remote  cluster.  If  that  cluster  contains  two  (or  more) 
hosts  with  nearly  identical  performance,  which  is  often 
the  case  in  practice,  then  both  those  hosts  can  achieve 
nearly  the  same  MCT  for  that  task.  If  the  file  is  of 
significant  size  compared  to  network  bandwidths  avail¬ 
able,  then  it  is  likely  that  those  two  hosts  lead  to  the 
best  and  second-best  MCTs  for  To-  This  means  that 
the  sufferage  value  will  be  close  to  zero,  giving  the  task 
low  priority.  Other  tasks  may  be  scheduled  in  its  place, 
generate  load  on  the  hosts  in  the  cluster,  and  eventually 
force  T0  to  be  scheduled  on  some  other  cluster,  thereby 
requiring  an  additional  file  transfer.  This  can  have  a 
dramatic  impact  on  the  overall  application  makespan 
as  it  leads  to  poor  file  re-use  among  tasks,  especially  in 
wide-area  bandwidth-constrained  environments. 

We  solved  this  problem  in  XSufferage  by  using  a 
modified  sufferage  value  definition.  For  each  task  and 
for  each  cluster  we  compute  the  task’s  MCT  only  for 
hosts  in  the  given  cluster  and  call  that  value  the  cluster- 
level  MCT .  The  cluster-level  sufferage  value  is  com¬ 
puted  as  the  difference  between  the  best  and  second- 
best  cluster-level  MCT.  The  task  with  the  highest 
cluster-level  sufferage  is  given  priority  and  is  sched¬ 
uled  to  the  host  that  achieves  the  earliest  MCT  within 
the  cluster  that  achieves  the  earliest  cluster-level  MCT. 
Appendix  8  gives  formal  descriptions  of  Max-min,  Min- 
min,  Sufferage,  and  XSufferage. 

4.2.  Simulating  Parameter  Sweeps 
in  Grid  Environments 

In  order  to  evaluate  the  efficacy  of  the  heuristics  de¬ 
scribed  earlier  we  developed  a  Grid  parameter  sweep 
simulator.  At  present,  little  software  is  available  for 
Grid  simulation.  Among  the  most  promising  work, 
the  Bricks  project  [30]  addresses  the  question  of  simu¬ 
lating  heterogeneous  distributed  environment  for  the 
purpose  of  evaluating  scheduling  strategies,  but  no 
public  implementation  is  available  at  the  moment. 
Furthermore,  Bricks  targets  ’’global  computing  sys¬ 
tems”  [5,  27,  11,  10]  rather  than  application  sched¬ 
ulers.  It  assumes  constant  task  and  data  arrival  rates 
to  servers  and  uses  queuing  theory  in  an  attempt  to 
model  many  users  who  asynchronously  interact  with  a 
global  computing  system.  By  contrast,  our  simulator 
is  purely  event-driven  which  is  more  appropriate  in  our 
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framework  where  the  scheduler  knows  all  tasks  a-priori 
and  is  in  charge  of  only  one  application.  The  Micro- 
Grids  [15]  project  will  also  be  of  interest  for  gaining 
insight  on  how  our  simulation  results  hold  under  more 
realistic  assumptions.  At  present,  it  cannot  be  used 
to  perform  large  numbers  of  runs  of  large-scale  appli¬ 
cations  as  it  emulates  the  Grid  rather  than  simulates 
runs  of  the  application.  However,  Micro-Grids  uses  a 
network  simulator  that  could  help  us  model  network 
traffic  more  accurately  by  taking  into  account  physi¬ 
cal  network  topology  and  link  contention  due  to  that 
topology. 

Our  simulator  allows  us  to  compare  heuristics  un¬ 
der  the  same  load  conditions,  in  a  reproducible  man¬ 
ner,  for  a  wide  variety  of  system  states  and  application 
scenarios.  In  addition;  we  verified  the  accuracy  of  our 
simulated  results  by  comparing  experimental  runs  in 
shared,  production  environments  with  similarly  loaded 
simulation  application  execution  times.  Our  simula¬ 
tor  takes  as  input  schedule (),  a  task/host  assign¬ 
ment  heuristic,  a  description  of  the  application  tasks 
and  input/output  files,  and  a  description  of  the  Grid 
topology  with  performance  characteristics  of  Grid  re¬ 
sources.  These  characteristics  can  be  constant  values, 
samples  from  random  distributions,  or  traces  from  the 
NWS  [31].  In  this  work  we  use  only  NWS  traces  as 
they  lead  to  more  realistic  models.  The  simulator  also 
allows  for  adding  and  removing  resources  dynamically, 
but  we  do  not  perform  any  experiments  with  transient 
resources  in  this  paper.  The  output  of  the  simulator  is 
a  makespan  value  based  on  the  set  of  input  parameters. 
More  details  on  the  simulator  can  be  found  in  [6]. 

4.3.  Simulation  Results 

4.3.1.  Random  Grids  and  Applications 

In  order  to  perform  a  fair  comparison  of  task/host  se¬ 
lection  heuristics  we  generated  1000  simulated  Grids 
and  2000  simulated  applications.  We  then  randomly 
picked  Grid/application  pairs  among  the  2,000,000  pos¬ 
sible,  and  ran  our  simulator  for  each  pair  with  all 
heuristics.  The  simulations  in  this  section  assume  100% 
accurate  performance  estimation  and  scheduling  events 
occur  every  500  seconds.  The  expectation  is  that  com¬ 
puting  statistical  characteristics  of  makespans  achieved 
by  each  heuristic  is  representative  if  the  sample  size  is 
large  enough,  that  is  if  enough  Grid /application  pairs 
are  simulated.  Before  presenting  the  results,  let  us  de¬ 
scribe  how  Grids  and  applications  were  generated. 

In  what  follows  we  denote  by  U(x,y)  the  discrete 
integer  uniform  probability  distribution  on  the  interval 
[x,y]  where  x  and  y  are  integers.  Each  Grid  contains 
a  U( 2,12)  number  of  clusters  and  each  of  those  clus¬ 


ters  contains  a  [7(2,32)  number  of  hosts.  The  perfor¬ 
mance  of  each  host  is  modeled  by  a  CPU  load  trace 
randomly  picked  among  50  different  actual  traces  ob¬ 
tained  from  the  NWS  for  various  hosts.  Each  trace  is 
then  shifted  by  a  random  offset,  so  that  two  hosts  using 
the  same  CPU  trace  do  not  exhibit  the  same  behavior 
at  the  same  time.  Similarly  network  link  performance 
is  modeled  by  randomly  picking  latency /bandwidth 
traces  among  20  different  NWS  traces.  All  the  NWS 
traces  that  we  use  for  simulations  in  this  paper  typi¬ 
cally  span  4  days  of  real  time  and  were  obtained  for 
hosts  in  various  US  research  institutions  and  for  net¬ 
work  links  between  these  institutions  (commercial  In¬ 
ternet  or  vBNS).  Traces  were  collected  during  the  first 
week  of  November  1999. 

In  accordance  with  typical  MCell  scenarios  we  gen¬ 
erate  applications  as  sets  of  independent  Monte-Carlo 
simulations,  with  the  tasks  of  a  simulation  sharing  a 
(potentially  large)  input  data  file  for  describing  3-D 
geometries.  All  tasks  take  as  input  one  additional  file 
of  1  KByte  and  generate  an  output  file  of  10  KBytes. 
An  application  is  composed  of  a  {7(2,10)  number  of 
Monte-Carlo  simulations,  with  each  simulation  com¬ 
posed  of  a  [7(20,1000)  number  of  tasks.  Each  task 
requires  a  U( 100,300)  number  of  seconds  of  compu¬ 
tation  on  an  unloaded  base  CPU.  Finally,  the  size 
in  KBytes  of  the  geometry  file  associated  with  each 
Monte-Carlo  simulation  is  [7(400,100000),  meaning 
that  those  files  can  reach  the  size  of  approximatively 
95  MBytes.  These  distributions  are  representative  of 
what  can  be  expected  from  real  MCell  applications.  We 
generated  one-thousand  applications  following  exactly 
this  method,  and  we  also  generated  another  thousand 
by  adding  random  file/task  dependencies.  The  idea  is 
to  create  some  perturbation  of  the  regular  structure  of 
the  file/task  dependency  graph  and  investigate  if  such 
a  perturbation  has  an  impact  on  the  relative  perfor¬ 
mances  of  the  different  task/host  selection  heuristics. 
The  perturbation  consisted  in  adding  a  number  of  ran¬ 
dom  additional  dependencies  on  the  order  of  one  fifth  of 
the  total  number  of  tasks.  Even  though  such  perturba¬ 
tions  take  us  away  from  typical  MCell  applications,  it 
can  be  interesting  to  see  how  they  affect  the  heuristics. 

The  results  are  Summarized  in  Table  4.3  for  both 
standard  applications  and  applications  with  file/task 
dependencies  perturbation.  For  each  scheduling  heuris¬ 
tic  (including  a  self-scheduled  workqueue  [12])  and  for 
each  type  of  application,  the  table  contains  three  per¬ 
formance  values.  The  results  are  computed  over  1000 
random  Grid /application  pairs.  We  use  the  geomet¬ 
ric  mean  of  the  makespans  rather  than  the  arithmetic 
mean  to  account  for  the  fact  that,  depending  on  the 
Grid  and  the  application,  makespans  in  the  sample 
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Table  1.  Results  for  Random  Grid/Application  pairs 


No  Perturbation 

Perturbation 

Scheduling 

Geometric 
Mean  (sec) 

Av.  Degradation 
from  Best  (%) 

Average 

Rank 

Geometric 
Mean  (sec) 

Av.  Degradation 
from  Best  (%) 

Average 

Rank 

Max-min 

2390 

17.3 

3.1 

2549 

18.9 

3.1 

Min-min 

2452 

21.2 

3.0 

2619 

23.2 

3.0 

Sufferage 

2329 

14.1 

2.8 

2505 

16.7 

2.9 

XSufferage 

2174 

6.2 

1.8 

2316 

7.9 

1.8 

Workqueue 

2850 

42.2 

4.3 

3091 

48.1 

4.2 

space  can  be  of  different  orders  of  magnitude  (one  large 
makespan  could  easily  dominate  the  arithmetic  mean). 

The  second  performance  value  is  the  ’’average  degra¬ 
dation  from  the  best”  which  is  a  measure  of  how  far  a 
heuristic  is  from  the  best  heuristic  on  average.  For  each 
heuristic,  it  is  computed  as  the  arithmetic  mean  over  all 
Grid/application  pairs  of  the  relative  difference  of  the 
makespans  for  that  heuristic  and  for  the  best  heuristic. 
The  smaller  that  value,  the  closer  the  heuristic  to  being 
the  best  one  on  average. 

The  third  performance  value  is  the  ’’average  rank”  of 
a  heuristic  over  all  Grid/application  pairs.  The  average 
rank  is  computed  as  the  arithmetic  mean  of  the  rank  (1 
to  5),  where  the  heuristic  leading  to  the  best  makespan 
is  of  rank  1  and  the  one  leading  to  the  worst  makespan 
is  of  rank  5. 

These  three  different  performance  values  all  have 
slightly  different  interpretations  and  make  it  possible 
to  gain  a  clear  understanding  of  how  the  heuristics 
compare  with  one  another.  For  instance,  it  could  be 
that  heuristic  1  is  the  best  one  in  most  cases,  and  that 
heuristic  2  is  the  second-best  one  in  most  cases  as  well. 
In  that  case,  their  average  ranks  will  be  close  to  1  and  2 
respectively.  This  might  lead  us  to  think  that  heuristic 
one  is  preferable.  However,  it  is  possible  that  the  av¬ 
erage  degradation  from  best  are  respectively  20%  and 
5%.  For  instance,  it  can  be  that  when  heuristic  1  is  not 
the  best  one  it  is  far  worse  than  the  best  one,  whereas 
heuristic  2  might  not  be  best  often,  but  is  never  far 
behind  the  best  one  in  practice.  In  that  case,  we  would 
probably  conclude  that  heuristic  2  is  preferable. 

The  main  message  from  Table  4.3  is  that  XSuffer- 
age  is  the  best  heuristic  as  it  leads  to  the  best  geo¬ 
metric  mean,  average  degradation  from  the  best,  and 
average  rank  for  perturbed  and  non-perturbed  appli¬ 
cations.  Its  average  degradation  from  the  best  is  at 
least  twice  smaller  than  that  of  any  other  heuristics  for 
standard  applications  and  applications  with  perturba¬ 
tion.  Its  average  rank  is  better  than  any  other  by  1 
unit.  Note  that  the  workqueue,  in  these  experiments, 


is  the  least  efficient  scheduling  algorithm  as  its  average 
rank  is  larger  than  4  units.  Max-min  and  Sufferage  are  . 
comparable  with  a  slight  advantage  to  Max-min,  and 
Min-min  seems  less  efficient. 

Note  that  all  the  results  in  Table  4.3  are  averages 
over  a  large  number  of  experiments.  In  the  following 
sections  we  will  see  cases  where  Min-min  leads  to  good 
makespans  when  compared  to  Max-min  and  Sufferage. 
We  claim  that  the  experimental  results  presented  in 
this  section  are  sufficient  to  show  that  XSufferage  is 
the  one  of  the  four  heuristics  that  leads  to  best  sched¬ 
ules  for  parameter  sweep  applications,  given  the  models 
described  in  Section  2. 

4.3.2.  Varying  Shared  File  Sizes 

Figure  5(a)  show  simulation  results  for  the  following 
application  and  Grid.  The  application  consists  of  1600 
tasks,  where  each  task  takes  as  input  a  10K  un- shared 
file  and  one  of  eight  identical  shared  files,  each  shared 
by  200  tasks.  All  tasks  are  identical  in  terms  of  compu¬ 
tational  cost  (200  seconds  on  an  unloaded  base  CPU) 
and  produce  a  10K  output  file.  This  application  set¬ 
ting  is  comparable  to  what  some  of  our  target  param¬ 
eter  sweep  applications  require.  We  simulate  a  Grid 
such  as  one  that  could  realistically  be  used  by  a  user 
based  at  UCSD.  That  Grid  contains  5  clusters  contain¬ 
ing  respectively  6,  6,  8,  20,  and  20  hosts.  The  perfor¬ 
mance  characteristics  of  the  hosts  are  based  on  actual 
CPU  traces  obtained  via  the  Network  Weather  Service. 
The  network  links  are  also  modeled  from  NWS  traces 
obtained  over  the  course  of  a  day  between  a  worksta¬ 
tion  at  UCSD  and  several  remote  sites  accessible  by 
commercial  Internet  links  or  the  vBNS.  Band  widths 
on  these  links  varies  from  as  little  as  6  KBytes/sec 
to  600  KBytes/sec  depending  on  the  link  and  on  the 
time  of  the  day.  In  terms  of  bandwidth  averages,  one 
can  classify  two  of  the  links  as  fast  (500  KBytes/sec), 
three  of  the  links  as  moderate  (between  100  and  200 
KBytes/sec),  and  one  as  slow  (50  KBytes/sec).  One 
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Figure  5.  Makespan  vs.  shared  file  size  for 
different  heuristics 


of  the  two  large  20-host  clusters  is  accessible  via  a  fast 
link.  For  large  shared  file  sizes  on  the  graphs,  the  av¬ 
erage  ratio  between  file  transfer  time  and  computation 
time  for  one  task  is  about  3  on  a  fast  link  and  30  on 
a  slow  link.  For  the  smallest  shared  file  sizes  in  our 
study,  that  ratio  is  about  0.2  on  a  fast  link  and  2  on 
a  slow  link.  For  these  experiments,  the  interval  be¬ 
tween  scheduling  events  was  always  500  seconds,  and 
we  assumed  100%  accurate  performance  estimations. 

The  graph  in  Figure  5  plots  makespans  vs.  shared 
file  shared  file  sizes  between  10  and  150  MBytes  for 
the  four  heuristics  and  the  self-scheduled  workqueue. 
For  very  small  shared  file  sizes  (up  to  100  KBytes,  not 
plotted  on  the  graph),  the  workqueue  leads  to  better 
makespans  than  other  heuristics  when  files  are  so  small 
that  the  effect  of  file  sharing  becomes  negligible.  How¬ 
ever,  the  workqueue  quickly  becomes  inefficient  when 
shared  file  size  increases.  The  other  heuristics  perform 
similarly  for  small  shared  file  sizes  but  one  can  see  on 
the  graph  that  XSufferage  performs  at  least  about  20% 
better  than  Min-min  and  40%  better  than  Max-min 
and  Sufferage  for  a  file  size  of  150  MBytes.  We  obtained 
similar  results  with  different  Grid  configurations.  We 
also  performed  experiments  with  much  larger  file  sizes. 
Even  though  those  experiments  are  not  very  realistic 
given  the  current  networking  capabilities  they  provide 
information  about  what  happens  when  file  re-use  is  the 
only  constraint  for  achieving  good  scheduling.  The  re¬ 


sults  showed  that  XSufferage  constantly  outperforms 
all  other  heuristics  by  at  least  50%.  These  results  show 
that  XSufferage  does  a  better  job  at  capturing  and  tak¬ 
ing  advantage  of  file  sharing  patterns  to  maximize  file 
re-use. 

5.  Adaptive  Scheduling 
5.1.  Quality  of  Information 

A  new  avenue  of  research  that  we  are  beginning  to 
explore  is  the  study  of  Quality  of  Information  (Qol)  on 
scheduling,  that  is  the  impact  of  the  performance  es¬ 
timation  accuracy  and  other  qualitative  attributes  on 
different  scheduling  strategies  .  We  expect  different 
heuristics  to  react  differently  to  degrading  levels  of  ac¬ 
curacy  and  that  strategies  that  do  not  depend  on  per¬ 
formance  estimation  and  forecast  (e.g.  self-scheduled 
ones  [16])  will  be  more  performance-efficient  when  Qol 
is  low.  Low  Qol  can  also  be  accounted  for  in  adap¬ 
tive  scheduling  algorithms  such  as  schedule ().  The 
following  section  presents  our  first  simulation  results 
for  different  levels  of  Qol  and  for  increasingly  adaptive 
versions  of  schedule (). 

Our  initial  model  for  simulating  different  levels  of 
Qol  is  simple.  Our  simulator  allows  us  to  obtain  100% 
accurate  estimates  for  all  file  transfer  or  computational 
times  and  we  add  random  noise  to  those  estimates  to 
simulate  inaccurate  performance  estimates.  For  each 
estimate  used  by  the  scheduling  algorithm  we  intro¬ 
duce  a  percentage  error  that  is  uniformly  distributed 
on  the  interval  [— p,  4 -p]  where  p  is  a  value  between 
0%  and  100%.  Perfectly  accurate  Qol  corresponds  to 
p  =  0,  whereas  p  =  10  means  that  every  100%  accurate 
estimate  will  be  randomly  increased  or  decreased  by  up 
to  10%. 

This  model  is  sufficient  for  obtaining  initial  results 
concerning  the  impact  of  Qol  on  the  scheduling  of  pa¬ 
rameter  sweep  applications,  however  it  makes  two  as¬ 
sumptions  that  are  not  realistic  for  real  forecasting  ser¬ 
vices  that  will  be  deployed  in  Computational  Grids. 
First,  it  assumes  that  Qol  behavior  is  the  same  for 
all  estimates  (for  file  transfer  times  and  computational 
times)  and  for  all  resources.  This  is  clearly  not  the  case 
as  network  behavior  is  significantly  different  from  CPU 
behavior  for  performance  prediction  purposes  [32],  and 
some  resources  will  generally  be  more  predictable  than 
others  on  a  regular  basis.  Second,  it  assumes  that 
Qol  behavior  does  not  depend  on  whether  a  forecast 

The  term  "quality  of  Information  is  used  to  describe  qual¬ 
itative  aspects  of  performance  predictions  in  the  Performance 
Prediction  Engineering  Project  [14]. 
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is  needed  for  an  event  in  the  short-term  or  in  the  long¬ 
term.  For  instance,  our  model  uses  the  same  error 
model  for  predicting  a  file  transfer  time  if  the  transfer 
is  initiated  in  the  next  minute  or  in  an  hour.  A  more 
realistic  model  should  probably  try  to  capture  some 
decay  of  the  Qol  as  predictions  become  more  and  more 
long-term.  Note  that  this  issue  becomes  less  critical 
for  high  scheduling  event  frequencies.  Future  work  will 
aim  at  providing  a  more  realistic  model  of  Qol  based 
on  experiments  with  deployed  Grid  services,  such  as 
the  Network  Weather  Service  [31],  and  with  a  variety 
of  Grid  resources. 

5.2.  Simulation  Results 

Figure  6  shows  simulation  results  for  four  different 
scheduling  event  frequencies  and  decreasing  Qol  levels. 

We  use  the  same  simulated  Grid  as  the  one  used 
in  Section  4.3.2  and  the  application  is  modeled  after 
an  MCell  computation  that  performs  eight  moderate- 
size  Monte-Carlo  simulations  (100  tasks  each)  for  eight 
different  geometry  configurations  of  a  neuro-muscular 
junction.  Geometry  files  are  on  the  order  of  40  MBytes, 
meaning  that  network  transfer  times  for  those  files  take 
on  average  80  seconds  on  a  fast  link  and  about  800 
seconds  on  a  slow  link.  The  average  task  computational 
time  over  all  hosts  is  about  110  seconds,  can  be  as  fast 
at  90  seconds,  and  as  slow  as  350  seconds  depending 
on  the  host  and  on  its  load  when  the  task  is  running 
(as  simulated  by  an  offset  in  an  NWS  CPU  load  trace). 

All  data  points  in  the  graphs  of  Figure  6  are  com¬ 
puted  as  the  average  makespan  over  50  simulated  runs. 
This  is  necessary  since  we  introduce  random  noise  to 
performance  estimates  in  order  to  simulate  different 
levels  of  Qol.  All  graphs  plot  average  makespans  vs. 
values  of  p  (defined  in  Section  5.1,  for  the  heuristics 
presented  in  Section  4.1  as  well  as  for  a  self-scheduled 
workqueue  algorithm.  Since  the  workqueue  does  not 
make  use  of  performance  estimates  it  is  not  sensitive 
to  Qol.  It  is  shown  as  a  horizontal  solid  line  on  the 
four  graphs  (with  a  makespan  of  1730  seconds).  The 
variances  associated  with  the  50  samples  for  each  data 
point  were  small  for  all  heuristics:  coefficients  of  varia¬ 
tions  were  on  the  order  of  5%. 

Graph  6(a)  plots  results  when  there  is  only  one 
initial  scheduling  event,  meaning  that  the  scheduling 
algorithm  is  not  adaptive.  All  heuristics  but  Max- 
min  lead  to  better  makespans  than  the  workqueue  for 
perfect  Qol  (p  =  0),  but  their  performance  degrades 
very  rapidly  when  p  increases.  Max-min  leads  to  the 
worst  makespans,  but  over  all,  all  heuristics  lead  to 
makespans  at  least  40%  larger  than  the  workqueue 
when  p  is  greater  than  50.  This  result  is  not  surprising 


as  the  cumulative  errors  of  performance  estimates  im¬ 
pact  the  computations  of  the  various  MCTs  required 
by  the  heuristics. 

Graph  6(b)  shows  the  results  when  there  is  a 
scheduling  event  every  500  seconds,  or  3  times  dur¬ 
ing  each  run  of  the  application  in  this  case.  One  can 
notice  that  some  heuristics  outperform  the  workqueue 
for  values  of  p  up  to  20.  Max-min  and  Sufferage  exhibit 
less  performance  as  soon  as  the  Qol  is  not  perfect. 

Graph  6(c)  shows  the  results  when  there  is  a  schedul¬ 
ing  event  every  250  seconds,  or  between  5  and  8  times 
for  each  run  depending  on  the  heuristic  being  used. 
The  effect  observed  on  graph  6(b)  is  more  pronounced 
in  that  heuristics  become  more  tolerant  to  low  Qol 
thanks  to  increased  adaptivity,  even  though  Max-min 
still  leads  to  large  makespans.  Sufferage  outperforms 
the  workqueue  for  values  of  p  lower  than  30,  whereas 
Min-min  and  XSufferage  lead  to  better  makespan  than 
the  workqueue  for  all  values  of  p.  For  perfect  accuracy, 
XSufferage  outperforms  workqueue  by  as  much  as  25%. 

Finally,  Graph  6(d)  shows  results  for  scheduling 
events  every  125  seconds,  or  between  11  and  14  times 
per  run.  Sufferage  now  outperforms  the  workqueue  for 
p  up  to  80,  whereas  Min-min  and  XSufferage  keep  bene¬ 
fiting  from  increased  adaptivity.  The  results  show  little 
improvements  for  higher  scheduling  frequencies.  This 
is  due  to  the  granularity  of  the  application:  since  tasks 
take  at  least  90  seconds,  little  can  be  gained  by  calling 
schedule ()  more  than  once  every  125  seconds.  For 
“good”  Qol  (p  <  5%),  XSufferage  always  outperforms 
Min-min. 

Note  that  these  results  are  preliminary  and  that  it 
is  difficult  to  use  them  to  rank  the  different  heuristics 
according  to  their  respective  robustness  to  inaccurate 
performance  predictions.  It  will  be  necessary  to  per¬ 
form  experiments  for  large  numbers  of  different  Grid 
configurations  and  application  structures  as  was  done 
in  Section  4.3.1.  A  future  paper  will  contain  results 
from  such  experiments  as  well  as  a  more  in-depth  study 
of  Qol  issues. 

Note  also  that  in  these  experiments  we  assume  that 
the  Qol  does  not  depend  on  the  scheduling  event  fre¬ 
quency  (see  the  discussion  in  Section  5.1).  However, 
a  high  scheduling  frequency  implies  that  the  heuristics 
do  not  use  long-term  predictions  (see  the  discussion 
on  step  (5)  in  Section  3.2).  Assuming  that  short-term 
predictions  are  typically  more  accurate  than  long-term 
predictions,  higher  scheduling  frequencies  lead  to  im¬ 
proved  Qol.  On  Graph  6(d)  we  show  simulation  results 
for  values  of  p  up  to  100,  but  we  expect  that  in  reason¬ 
ably  stable  Grid  environments  with  appropriate  fore¬ 
casting  services,  the  performance  estimation  error  will 
not  be  as  large  as  +/-  100%  for  short-term  predictions. 
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(b)  scheduling  events  every  500  sec 
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(c)  scheduling  events  every  250  sec 
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Figure  6.  Simulation  results  for  various  levels  of  Qol  and  adaptivity 
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The  left  part  of  the  graph  should  be  more  representa¬ 
tive  of  what  we  can  expect  for  real  systems. 

Finally,  the  scheduling  event  frequency  impacts  the 
performance  of  the  adaptive  algorithm  even  for  perfect 
Qol.  For  instance,  XSufferage  has  an  average  makespan 
around  1550  seconds  for  p  =  0  and  scheduling  events 
every  500  seconds,  whereas  that  average  makespan  be¬ 
comes  lower  than  1400  seconds  for  larger  frequencies. 
All  heuristics  but  Max-min  achieve  better  makespan 
for  higher  frequencies  in  the  case  of  perfect  Qol.  This 
is  rather  counter-intuitive  as  one  would  expect  that 
perfect  Qol  would  not  mandate  any  adaptivity  at  all. 
The  fact  is  that  applying  those  heuristics  on  small 
sets  of  tasks  leads  to  better  scheduling  decisions  and 
shorter  makespans.  As  tasks  complete,  successive  calls 
to  schedule  ()  apply ’the  heuristics  to  the  decreasing 
sets  of  tasks,  leading  to  better  overall  makespans.  This 
suggests  that  calling  the  scheduling  algorithm  repeat¬ 
edly  is  a  good  idea  for  Sufferage,  Min-min,  and  XSuf¬ 
ferage,  even  if  the  environment  is  very  predictable. 

6.  Summary  of  Simulation  Results 

The  simulation  results  in  Section  4.3.1  showed  that 
XSufferage  is  on  average  a  better  heuristic  than  the 
traditional  Max-min,  Min-min,  Sufferage,  and  self- 
scheduled  workqueue  for  scheduling  parameter  sweep 
applications  such  as  MCell  on  Computational  Grids, 
provided  the  models  of  Section  2  hold.  We  believe  that 
Xsufferage  leads  to  better  results  because  it  better  cap¬ 
tures  the  file/ task  dependencies  of  the  application  and 
leads  to  improved  file  re-use.  This  claim  is  supported 
by  the  results  in  Section  4.3.2.  Indeed,  all  experiments 
we  have  conducted  with  very  large  shared  files  seem 
to  indicate  that  XSufferage  leads  to  better  makespans 
than  its  contenders.  Finally,  the  preliminary  study 
of  Quality  of  Information  and  adaptivity  in  Section  5 
showed  that  all  heuristics  can  benefit  from  increased 
adaptivity  and  from  increased  Qol.  The  results  seem 
to  indicate  that  XSufferage  leads  to  very  good  results 
for  good  Qol  and  compares  well  with  other  heuristics 
for  poor  Qol. 

It  is  always  difficult  to  make  general  statements 
about  the  relative  efficiency  of  scheduling  algorithms 
since  the  space  of  possible  Grid  configurations  and  ap¬ 
plication  structures  is  very  large.  The  solution  is  to 
sample  both  the  Grid  and  the  application  space  as 
much  as  possible  as  was  done  in  Section  4.3.1.  Future 
work  will  contain  such  sampling  for  experiments  similar 
to  the  ones  presented  in  Section  4.3.2  and  5  in  order  to 
make  the  results  concerning  shared  file  sizes  and  Qol 
more  general.  Ironically,  performing  such  large-scale 
simulations  in  the  Grid/application  space  is  itself  a  pa¬ 


rameter  sweep  application  and  we  will  probably  use  the 
PST  software  to  distribute  it  on  a  real  computational 
Grid. 

7.  Related  Work 

A  large  number  of  research  papers  address  the  ques¬ 
tion  of  mapping  sets  of  tasks  onto  sets  of  processors  in 
a  view  to  minimizing  overall  execution  time.  Many  of 
these  papers  address  the  case  where  tasks  are  indepen¬ 
dent  [17,  12,  16,  20].  Scheduling  heuristics  found  in  [20] 
were  adapted  to  our  framework  as  discussed  in  Sec¬ 
tion  4.  All  these  papers  make  simplifying  assumptions 
for  task  execution  times  (constant,  following  a  trun¬ 
cated  Gaussian  distribution,  etc.)  and  none  of  them 
take  into  account  data  storage  issues. 

The  work  described  in  [2]  focuses  on  scheduling  ap¬ 
plications  structured  as  DAGs  on  heterogeneous  sets 
of  processors  and  uses  heuristics  that  are  related  to 
Max-min  and  Min-min  for  Lev el-by- Level  scheduling 
of  the  graphs.  However,  special  attention  is  paid  to 
data  storage  issues  which  makes  that  work  related  to 
the  research  presented  in  this  paper.  A  major  differ¬ 
ence  between  our  work  and  the  development  in  [2]  is 
that  the  latter  assumes  constant  perfectly  predictable 
performance  characteristics  for  resources  as  should  be 
available  in  advanced  reservation  QoS  environments. 
Also,  the  different  application  structures  (Parameter 
Sweep  vs.  DAGs)  lead  to  many  differences  between 
the  models  in  this  paper  and  in  [2].  For  instance,  data 
repositories  are  located  anywhere  on  the  network  (as 
opposed  within  a  cluster)  and  datasets  are  pre-staged 
to  these  repositories.  Finally,  [4]  contains  a  survey 
that  encompasses  several  heuristics  in  addition  to  the 
ones  described  in  [20].  We  will  consider  these  heuris¬ 
tics  in  our  future  work.  This  work  contrasts  to  others 
in  that  we:  (i)  take  into  account  application  data  stor¬ 
age;  (ii)  model  shared,  heterogeneous  computational 
and  network  resources  with  realistic  dynamic  perfor¬ 
mance  characteristics;  (iii)  study  the  impact  of  the  ac¬ 
curacy  of  performance  prediction;  (iv)  introduce  a  new 
heuristic  for  scheduling  parameter  sweep  applications 
(XSufferage). 

This  work  is  also  related  to  our  work  on  an  AppLeS 
Parameter  Sweep  Template  (PST)  in  that  the  results 
in  this  paper  provide  a  good  justification  that  XSuffer- 
age  should  be  implemented  as  part  of  the  PST  sched¬ 
uler.  PST  will  provide  with  a  practical  way  to  deploy 
and  schedule  parameter  sweep  on  the  computational 
Grid  using  available  software  infrastructures  and  will 
be  described  in  a  future  paper.  PST  itself  is  related 
to  the  Nimrod  project  [1].  Nimrod  targets  parameter 
sweep  applications  but  its  scheduling  approach  is  dif- 
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ferent  from  ours  as  it  is  based  on  deadlines  and  on  a 
Grid  economy  model.  Also,  to  the  best  of  our  knowl¬ 
edge,  Nimrod  does  not  take  into  account  dynamic  Grid 
conditions  or  file  locality  constraints  for  scheduling.  In 
fact,  the  work  in  this  paper  and  our  work  on  the  PST 
software  should  be  applicable  to  Nimrod  and  one  can 
envision  an  implementation  of  PST  as  a  Nimrod  mod¬ 
ule. 

8.  Conclusion  and  Future  Work 

In  this  paper  we  have  proposed  an  adaptive  schedul¬ 
ing  algorithm  for  parameter  sweep  applications  in  Grid 
environments.  In  particular,  we  address  the  case  of 
applications  where  tasks  can  share  input  files  (e.g. 
MCell  [29])  and  the  case  of  non-dedicated  Computa¬ 
tional  Grids  that  span  non-dedicated  wide-area  net¬ 
works.  After  precisely  defining  our  application  and 
Grid  model,  we  adapted  three  standard  heuristics  for 
performing  task/host  assignment  ( Max-min ,  Min-min , 
Sufferage)  and  proposed  an  extension  of  the  Sufferage 
heuristic,  XSufferage.  We  also  introduced  the  notion 
of  Quality  of  Information  (Qol)  to  account  for  inac¬ 
curacies  in  performance  predictions.  We  use  simula¬ 
tion  to  compare  the  four  heuristics  and  a  self-scheduled 
workqueue  algorithm  in  multiple  settings  with  vari¬ 
ous  shared  files  sizes,  levels  of  Qol,  application  struc¬ 
tures,  Grid  topologies  and  resources.  The  simulation 
results  demonstrated  that:  (i)  XSufferage  leads  to  bet¬ 
ter  schedules  on  average  by  quite  a  large  margin;  (ii)  In¬ 
creased  adaptivity  benefits  all  four  heuristics  even  for 
perfect  Qol;  (iii)  XSufferage  leads  to  better  schedules 
for  larger  shared  file  sizes;  (iv)  XSufferage  is  as  tolerant 
as  the  other  heuristics  to  poor  Qol  and  more  efficient 
for  good  Qol. 

Future  work  will  provide  improvements  to  our  mod¬ 
els  such  as  more  realistic  network  and  storage  models 
(encompassing  shared-storage  and  link  contention,  and 
limited  storage  space),  and  alternate  application  usage 
scenarios.  We  will  also  study  the  concept  of  Qol  fur¬ 
ther  by  investigating  realistic  Qol  models  and  perform¬ 
ing  more  Qol-related  experiments  with  our  simulator. 
New  heuristics  such  as  the  ones  found  in  [4,  21]  will 
considered  for  implementing  step  (5)  of  our  algorithm. 
As  discussed  in  Section  3.2,  all  steps  of  the  algorithm 
can  lead  to  new  research  in  different  directions  (per¬ 
formance  prediction  and  forecasting,  task-space  reduc¬ 
tion,  trade-offs  between  schedule  disruption  vs.  maxi¬ 
mum  resource  utilization).  Also,  the  algorithm  can  be 
adapted  to  provide  ways  to  perform  adaptive  schedul¬ 
ing  for  other  classes  of  applications  by  using  different 
heuristics  in  step  (5).  Finally,  we  will  incorporate  the 
results  here,  as  well  as  many  of  these  future  improve¬ 


ments,  into  a  practical  programming  environment  and 
adaptive  scheduler  for  parameter  sweep  applications  on 
the  Grid,  an  AppLeS  Parameter  Sweep  Template.  We 
believe  that  such  software  will  provide  a  useful  first  step 
in  achieving  performance  and  programmability  for  ap¬ 
plications  in  Grid  environments. 
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Appendix  A: 

Task/host  Selection  Heuristics 

Let  Hj,k  denote  the  host  within  the  cluster 
and  C{Ti,Hj,k)  the  estimated  completion  time  of  task 
Ti  on  host  Hj^k.  Let  us  define  the  argmin  operator: 

Definition:  Given  a  function  /  from  IRn  into  JR, 
f(a,xgmmxelRnf(x))  =  min^-  /(as). 

The  operator  denotes  one  of  the  possible  vectors  that 
achieves  the  minimum  of  the  function  /.  The  way  ties 
are  broken  is  left  the  implementation  and  in  this  work 
they  are  broken  randomly.  An  argmax  operator  can 
be  defined  in  a  similar  fashion.  Assuming  a  task  set  T, 
we  can  now  describe  each  heuristic  as  follows: 

Min-min 

while  (T  ^  0) 
foreach  (Ti  £  T) 

(Ci1)A-1))  =  argminj>Jfc(C'(Ti,flJ-,*)) 

end  foreach 

s  =  argmini(C,(Tj,F  (1)  h<u)) 
c* 

assign  Ts  to  H  (d  (d 

c3  ^ris 

T  —  T  —  {Ts} 
end  while 

Max-min 

while  ( T  ^  0) 
foreach  (Ti  £  T) 

(cj1\/»j1))  =  argmin ijk(C(Ti,Hj>k)) 
end  foreach 

s  =  argmaxi(C(Ti,Hc(i)h(i))) 
assign  Ts  to  H  (i)  ,  (i> 

ca  iKa 

T  =  T-{TS} 
end  while 
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Sufferage 


while  (T  /  0) 

foreach  (T{  G  T) 

(ci1),fcil))  =  ai«mmiifc(C7(ri,iril*)) 


(c-2),/ii2))  =  argminj#c(i)^h(i)  (C(Ti;  #,-,*)) 
«*/<  =  (?(?*,#  (2)i/i«2)),-’c'(Ti1H  (i,tfc<i,) 
end  foreach 
s  =  argmaXj  (su/J 
assign  Ts  to  <d  ,  (d 
T  -  T  -  {Ts} 


end  while 


XSufferage 

while  (T  ^0) 
foreach  (7*  G  T) 
foreach  cluster  j 

hi  ,j  =  argmin  k(C(TuHjtk)) 
end  foreach 

c\1}  =  argmin j{C{Ti,Hj>hij)) 

c|2)  =  argminj._i,c(i)  (C(Ti,  Hjthij)) 

sufi  =  C(Ti,Hcm  h  )-C(TuHc{i)  h  ) 

•  i.cj  ’ 

end  foreach 
s  =  argma x^suft) 
assign  Ts  to  Hca)h 

T  =  T  -  {Ts} 
end  while 


References 

[1]  D.  Abramson  and  J.  Giddy.  Scheduling  Large  Para¬ 
metric  Modelling  Experiments  on  a  Distributed  Meta¬ 
computer.  In  PCW’97,  Sep.  1997. 

[2]  A.  Alhusaini,  V.  Prasanna,  and  C.  Raghavendra.  A 
Unified  Resource  Scheduling  Framework  for  Hetero¬ 
geneous  Computing  Environments.  In  Proceedings 
of  the  8th  IEEE  Heterogeneous  Computing  Workshop 
(HCW’99),  pages  156-165,  Apr.  1999. 

[3]  F.  Berman.  The  Grid,  Blueprint  for  a  New  computing 
Infrastructure ,  chapter  12.  Morgan  Kaufmann  Pub¬ 
lishers,  Inc.,  1998.  Edited  by  Ian  Foster  and  Carl 
Kesselman. 

[4]  R.  Braun,  H.  Siegel,  N.  Beck,  L.  Boloni,  M.  Mah¬ 
eswaran,  A.  Reuther,  J.  Robertson,  M.  Theys,  B.  Yao, 
D.  Hensgen,  and  R.  Freund.  A  Comparison  Study  of 
Static  Mapping  Heuristics  for  a  Class  of  Meta-tasks 
on  Heterogeneous  Computing  Systems.  In  Proceed¬ 
ings  of  the  8th  Heterogeneous  Computing  Workshop 
(HCW’99),  pages  15-29,  Apr.  1999. 

[5]  H.  Casanova  and  J.  Dongarra.  NetSolve:  A  Network 
Server  for  Solving  Computational  Science  Problems. 


The  International  Journal  of  Supercomputer  Applica¬ 
tions  and  High  Performance  Computing ,  1997. 

[6]  H.  Casanova,  A.  Legrand,  D.  Zagorodnov,  and 
F.  Berman.  Using  Simulation  to  Evaluate  Scheduling 
Heuristics  for  a  Class  of  Applications  in  Grid  Environ¬ 
ments.  Technical  Report  RR1999-46,  Ecole  Normale 
Superieure  de  Lyon,  France,  Sep.  1999. 

[7]  W.  Clark.  The  Gantt  chart  Pitman  and  Sons,  London, 
3rd  edition,  1952. 

[8]  I.  Foster  and  C.  Kesselman,  editors.  The  Grid, 
Blueprint  for  a  New  computing  Infrastructure .  Mor¬ 
gan  Kaufmann  Publishers,  Inc.,  San  Francisco,  USA, 

1998. 

[9]  I.  Foster,  C.  Kesselman,  J.  Tedesco,  and  S.  Tuecke. 
GASS:  A  Data  Movement  and  Access  Service  for  Wide 
Area  Computing  Systems.  In  Proceedings  of  the  Sicth 
workshop  on  I/O  in  Parallel  and  Distributed  Systems , 
May  1999. 

[10]  I.  Foster  and  K.  Kesselman.  Globus:  A  Metacomput¬ 
ing  Infrastructure  Toolkit.  International  Journal  of 
Supercomputer  Applications ,  11(2):  115—128,  1997. 

[11]  A.  Grimshaw,  A.  Ferrari,  F.  Knabe,  and 

M.  Humphrey.  Wide-area  computing:  Resource 

sharing  on  a  large  scale.  In  IEEE  Computer  32(5), 
volume  32(5),  May  1999.  page  29-37. 

[12]  T.  Hagerup.  Allocating  Independent  Tasks  to  Parallel 
Processors:  An  Experimental  Study.  Journal  of  Par¬ 
allel  and  Distributed  Computing ,  47:185-197,  1997. 

[13]  http://apples.ucsd.edu. 

[14]  http://apples.ucsd.edu/perf.html. 

[15]  http://www  csag.ucsd.edu/projects/grid/microgrid.html. 

[16]  S.  F.  Hummel,  J.  Schmidt,  R.  N.  Uma,  and  J.  Wein. 
Load-sharing  in  heterogeneous  systems  via  weighted 
factoring.  In  Proceedings  of  the  8th  Annual  ACM 
Symposium  on  Parallel  Algorithms  and  Architectures , 
pages  318-328,  June  1996. 

[17]  0.  H.  Ibarra  and  C.  E.  Kim.  Heuristic  algorithms  for 
scheduling  independent  tasks  on  nonindentical  proces¬ 
sors.  Journal  of  the  ACM ,  24(2):280-289,  Apr.  1977. 

[18]  T.  Kidd  and  D.  Hensgen.  Why  the  Mean  is  Inadequate 
for  Accurate  Scheduling  Decisions.  In  Proceedings  of 
the  3rd  International  Symposium  on  Parallel  Archi¬ 
tectures,  Algorithms,  and  Networks  (ISPAN’99),  Jun. 

1999. 

[19]  B.  Lowekamp,  N.  Miller,  D.  Sutherland,  T.  Gross, 
P.  Steenkiste,  and  J.  Subhlok.  A  Resource  Query  In¬ 
terface  for  Network- Aware  Applications.  In  Proceed¬ 
ings  of  the  7th  IEEE  Smposium  on  High-Performance 
Distributed  Computing ,  July  1998. 

[20]  M.  Maheswaran,  S.  Ali,  H.  J.  Siegel,  D.  Hensgen, 
and  R.  Freund.  Dynamic  Matching  and  Scheduling 
of  a  Class  of  Independent  Tasks  onto  Heterogeneous 
Computing  Systems.  In  8th  Heterogeneous  Computing 
Workshop  ( HCWf99 /,  Apr.  1999. 

[21]  M.  Mitzenmacher.  How  useful  is  old  information.  In 
Proceedings  of  the  16th  ACM  Symposium  on  Principles 
of  Distributed  Computing ,  pages  83-91,  1997. 

[22]  M.  Pinedo.  Scheduling:  Theory,  Algorithms,  and  Sys¬ 
tems.  Prentice  Hall,  Englewood  Cliffs,  NJ,  1995. 


362 


[23]  J.  Plank,  M.  Beck,  W.  Elwasif,  T.  Moore,  ,  M.  Swany, 
and  R.  Wolski.  The  Internet  Backplane  Protocol: 
Storage  in  the  Network.  In  Proceedings  of  NetSore ’99: 
Network  Storage  Symposium,  Internets ,  199. 

[24]  J.  Pruyne  and  M.  Livny.  A  Worldwide  Flock  of  Con¬ 
dors  :  Load  Sharing  among  Workstation  Clusters  . 
Journal  on  Future  Generations  of  Computer  Systems , 
12,  1996. 

[25]  S.  Rogers  and  D.  Ywak.  Steady  and  Unsteady  Solu¬ 
tions  of  the  Incompressible  Navier-Stokes  Equations. 
AIAA  Journal ,  29(4):603-610,  Apr.  1991. 

[26]  J.  Schopf  and  F.  Berman.  Stochastic  Scheduling  .  In 
Proceedings  of  Super  Computing  ’99,  Portland ,  1999. 

[27]  S.  Sekiguchi,  M.  Sato,  H.  Nakada,  S.  Matsuoka,  and 
U.  Nagashima.  Ninf  :  Network  based  Information  Li¬ 
brary  for  Globally  High  Performance  Computing.  In 
Proc .  of  Parallel  Object-Oriented  Methods  and  Appli¬ 
cations  (POOMA),  Santa  Fe,  pages  39-48,  February 
1996. 

[28]  G.  Shao,  F.  Breman,  and  R.  Wolski.  Using  Effec¬ 
tive  Network  Views  to  Promote  Distributed  Applica¬ 
tion  Performance.  In  Proceedings  of  the  1999  Interna¬ 
tional  Conference  on  Parallel  and  Distributed  Process¬ 
ing  Techniques  and  Applications ,  1999. 

[29]  J.  Stiles,  T.  Bartol,  E.  Salpeter,  ,  and  M.  Salpeter. 
Monte  Carlo  simulation  of  neuromuscular  transmit¬ 
ter  release  using  MCell,  a  general  simulator  of  cellular 
physiological  processes.  Computational  Neuroscience, 
pages  279-284,  1998. 

[30]  A.  Takefusa,  S.  Matsuoka,  H.  Nakada,  K.  Aida,  and 
U.  Nagashima.  Overview  of  a  performance  evaluation 
system  for  global  computing  scheduling  algorithms. 
In  Proceedings  of  the  8th  IEEE  International  Sym¬ 
posium  on  High  Performance  Distributed  Computing 
(HPDC8),  pages  97-104,  Aug  1999. 

[31]  R.  Wolski.  Dynamically  Forecasting  Network  Perfor¬ 
mance  Using  the  Network  Weather  Service.  In  6th 
High-Performance  Distributed  Computing  Conference , 
pages  316-325,  August  1997. 

[32]  R.  Wolski,  N.  Spring,  and  J.  Hayes.  Predicting  the 
CPU  Availability  of  Time-shared  Unix  Systems  on  the 
computational  Grid.  In  Proceedings  of  the  8th  IEEE 
International  Symposium  on  High  Performance  Dis¬ 
tributed  Computing  (HPDC8),  Aug  1999. 


Henri  Casanova  is  a  Computer  Science  and 
Engineering  project  scientist  at  the  University  of 
California,  San  Diego.  His  research  interests  include 
all  areas  of  metacomputing,  and  in  particular  theoret¬ 
ical  models  for  the  efficient  scheduling  of  distributed 
application  in  computational  Grid  environments.  He 
received  his  BS  from  the  Ecole  Nationale  Superieure 
d’Electrotechnique,  d’Electronique,  dTnformatique 
et  d’Hydraulique  de  Toulouse  (ENSEEIHT),  his  MS 
from  the  Universite  Paul  Sabatier,  Toulouse,  and  his 
PhD  from  the  University  of  Tennessee,  Knoxville. 

Arnaud  Legrand  is  a  Computer  Science  graduate 
student  at  the  Ecole  Normale  Superieure,  the  leading 
French  scientific  research  and  teaching  institution, 
Lyon,  France.  He  is  interested  in  parallelism,  meta- 
computing  and  numerical  simulation  and  is  currently 
working  at  the  Laboratoire  de  lTnformatique  et  du 
Parallelisme  (Parallel  Computing  Laboratory,  ENS 
Lyon)  on  linear  algebra  algorithms  that  are  taylored 
to  heterogeneous  environments. 

Dmitrii  Zagorodnov  is  currently  pursuing  a  PhD  at 
the  Department  of  Computer  Science  and  Engineering 
at  the  University  of  California,  San  Diego.  He  has 
received  B.S.  and  M.S.  degrees  in  computer  science 
from  the  University  of  Alaska  Fairbanks  in  1995  and 
1997,  respectively.  His  current  research  interest  is  in 
fault  tolerance  for  distributed  systems. 

Francine  Berman  is  a  Professor  of  Computer  Science 
and  Engineering  at  U.  C.  San  Diego,  Senior  Fellow  at 
the  San  Diego  Supercomputer  Center,  Fellow  of  the 
ACM,  and  founder  of  the  Parallel  Computation  Lab¬ 
oratory  at  UCSD.  Her  research  interests  over  the  last 
two  decades  have  focused  on  parallel  and  distributed 
computation,  and  in  particular  the  areas  of  program¬ 
ming  environments,  tools,  and  models  that  support 
high-performance  computing.  She  received  her  B.A. 
from  the  University  of  California,  Los  Angeles,  and  her 
M.S.  and  Ph.D.  from  the  University  of  Washington. 


363 


Parallel  Program  Execution  on  a  Heterogeneous 
PC  Cluster  Using  Task  Duplication 

Yu-Kwong  Kwok 

Department  of  Electrical  and  Electronic  Engineering 
The  University  of  Hong  Kong,  Pokfulam  Road,  Hong  Kong 

Email:  ykwok@ eee.hku.hk 


Abstract* — In  this  paper,  we  propose  to  use  a 
duplication  based  approach  in  scheduling  tasks  to  a 
heterogeneous  cluster  of  PCs.  In  duplication  based 
scheduling,  critical  tasks  are  redundantly 
scheduled  to  more  than  one  machine  in  order  to 
reduce  the  number  of  inter-task  communication 
operations.  The  start  times  of  the  succeeding  tasks 
are  also  reduced.  The  task  duplication  process  is 
guided  given  the  system  heterogeneity  in  that  the 
critical  tasks  are  scheduled  or  replicated  in  faster 
machines.  The  algorithm  has  been  implemented  in 
our  prototype  program  parallelization  tool  for 
generating  MPI  code  executable  on  a  cluster  of 
Pentium  PCs.  Our  experiments  using  three 
numerical  applications  have  indicated  that 
heterogeneity  of  PC  cluster,  being  an  inevitable 
feature,  is  indeed  useful  for  optimizing  the 
execution  of  parallel  programs. 

Keywords:  Scheduling,  task  graphs,  algorithms, 
parallel  processing,  heterogeneous  systems,  PC 
cluster  computing,  task  duplication,  resource 
management. 

1  Introduction 

Recently  we  have  witnessed  an  increasing 
interest  in  employing  a  network  of  PCs  connected 
by  a  high-speed  network  to  tackle  many 
computationally  intensive  parallel  applications  [9], 
[18],  Parallel  processing  using  a  cluster  of 
machines,  also  commonly  called  cluster  computing, 
enables  a  much  larger  community  of  users  than  ever 
before  to  efficiently  tackle  many  difficult 
optimization  problems  on  a  readily  available 
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platform  [9],  [18].  However,  realizing  the  goal  of 
efficient  cluster  computing  entails  handling  a 
number  of  resource  management  chores  [18].  One 
of  the  most  important  problems  is  the  scheduling  of 
tasks.  Indeed,  to  effectively  harness  the  aggregate 
computing  power  of  such  a  heterogeneous  cluster,  it 
is  crucial  to  judiciously  allocate  and  sequence  the 
tasks  on  the  machines.  In  a  broad  sense,  the 
scheduling  problem  exists  in  two  forms:  dynamic 
and  static.  In  dynamic  scheduling,  few  assumptions 
about  the  parallel  program  can  be  made  before 
execution,  and  thus,  scheduling  decisions  have  to 
be  made  on-the-fly.  The  goal  of  a  dynamic 
scheduling  algorithm  as  such  includes  not  only  the 
minimization  of  the  program  completion  time  but 
also  the  minimization  of  scheduling  overhead, 
which  represents  a  significant  portion  of  the  cost 
paid  for  running  the  scheduler.  In  a  cluster  of  PCs 
environment,  such  dynamic  scheduling  algorithms 
usually  employ  the  so-called  “idle-cycle-stealing” 
approach  [5]  which  attempts  to  dynamically 
balance  the  work  load  evenly  across  all  the 
machines.  However,  when  the  objective  of 
scheduling  is  to  minimize  the  execution  time  of  a 
parallel  application,  such  dynamic  scheduling 
strategies  are  not  suitable. 

On  the  other  hand,  the  approach  of  using  static 
scheduling  algorithms  [11],  [12],  [22],  which  can 
afford  to  use  longer  time  to  generate  an  optimized 
schedule  off-line,  is  particularly  effective  for  many 
scientific  applications  such  as  the  adaptive 
simulation  of  N-body  problem,  object  recognition 
using  iterative  image  processing  algorithms,  and 
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some  other  numerical  applications  [1],  [3],  [4], 
[13],  [14],  [19],  [25]  because  the  characteristics  of 
such  applications  can  be  determined  at  compile¬ 
time.  A  parallel  program,  therefore,  can  be 
represented  by  a  directed  acyclic  task  graph  [3],  in 
which  the  node  weights  represent  task  processing 
times  and  the  edge  weights  represent  data 
dependencies  as  well  as  the  communication  times 
between  tasks  [3],  [6],  The  static  scheduling 
problem  is,  in  general,  NP-complete  [5],  [8]  and 
there  have  been  many  heuristics  suggested  in  the 
literature  for  scheduling  a  parallel  machine. 
However,  the  problem  of  scheduling  tasks  to  a 
cluster  is  a  relatively  less  explored  topic. 
Specifically,  there  are  two  difficult  research  issues 
to  be  tackled  in  the  scheduling  problem  for  cluster 
computing: 

1)  Communication  overhead :  The 

communication  overhead  in  a  network  of 
PCs  is  still  very  significant  relative  to  the 
processing  power  of  the  machines  [9].  Thus, 
to  avoid  offsetting  the  gain  from 
parallelization  by  excessive  communication 
overhead,  the  tasks  should  be  scheduled  in 
such  a  manner  that  the  number  of 
communications  is  kept  small. 

2)  Heterogeneity.  In  a  PC  cluster,  which 
typically  undergoes  continual  upgrading, 
heterogeneity  in  the  hardware  configuration 
is  unavoidable.  Heterogeneity  can  be  a 
potential  problem  for  some  highly  regular 
applications  (e.g.,  some  data  parallel 
problems).  However,  it  has  been 
demonstrated  that  heterogeneity  is  useful  for 
further  enhancing  the  performance  of 
irregularly  structured  parallel  application 
[7],  [21],  by  exploiting  the  affinity  of 
different  tasks  to  different  machines. 

In  this  study,  we  propose  to  use  a  duplication 
approach  to  scheduling  the  tasks  to  the  cluster.  In 


duplication  based  scheduling,  critical  tasks  are 
redundantly  scheduled  to  more  than  one  machines 
in  order  to  reduce  the  number  of  inter-task 
communication  operations.  The  start  times  of  the 
succeeding  tasks  are  also  reduced.  There  have  been 
many  duplication  approaches  suggested  in  the 
literature  [1],  [10],  [15],  [16],  [17],  [20],  However, 
all  these  methods  are  designed  for  homogeneous 
parallel  architectures.  Furthermore,  the  previous 
approaches  are  all  evaluated  based  on  simulations 
rather  than  using  real  applications  with  a 
parallelizing  compiler.  In  our  proposed  approach, 
the  task  duplication  process  is  guided  by  tracking 
the  critical  path  of  the  task  graph  given  the  system 
heterogeneity  in  that  the  critical  tasks  are  scheduled 
or  replicated  in  faster  machines.  Task  duplication  is 
indeed  particularly  effective  for  heterogeneous 
systems  because  the  overall  completion  time  of  an 
application  is  usually  determined  by  a  subset  of 
tasks  (i.e.,  the  critical-path ,  discussed  in  detailed  in 
Section  2)  which  can  be  scheduled  to  execute 
efficiently  on  the  faster  machines.  We  have 
implemented  this  duplication  based  scheduling 
algorithm  in  the  parallel  code  generator  of  a 
prototype  program  parallelization  tool  [2],  which 
generates  MPI  code  executable  on  a  network  of 
Pentium  PCs.  The  system  on  which  we  tested  our 
approach  is  shown  schematically  in  Figure  1 .  Our 
experiments  using  several  real  applications  have 
demonstrated  that  the  duplication  technique  is  very 
effective  in  reducing  the  completion  time  of  the 
applications  on  a  heterogeneous  cluster  of  Pentium 
II  PCs  connected  via  a  Fore  Fast  Ethernet  switch. 

The  remainder  of  this  paper  is  organized  as 
follows.  In  the  next  section,  we  describe  in  detail 
the  model  used  and  the  design  considerations  of  the 
duplication  algorithm.  Section  3  includes  the  results 
of  our  performance  study.  The  last  section 
concludes  the  paper. 
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Figure  1 :  System  support  for  high-performance  computing  on  a  heterogeneous  cluster. 


2  Scheduling  for  a  Heterogeneous  PC 
Cluster 

In  this  section,  we  first  describe  our  scheduling 
model,  followed  by  a  discussion  of  the  duplication 
techniques  employed  in  our  scheduling  module  of 
the  parallel  code  generator. 

2.1  The  Model 

A  parallel  program  is  composed  of  n  tasks 
{7,,  T2, Tn}  in  which  there  is  a  partial  order: 
Ti  <  Tj  implies  that  T  ■  cannot  start  execution  until 
Tj  finishes  due  to  the  data  dependency  between 
them.  Thus,  a  parallel  program  can  be  represented 
by  a  directed  acyclic  task  graph  [3].  Parallelism 
exists  among  independent  tasks — 7,  and  T j  are 
said  to  be  independent  if  neither  7f  <  7 j  nor 
T:  <  Tj .  Each  task  7,  is  associated  with  a  nominal 
execution  cost  x,  which  is  the  execution  time 
required  by  7,  on  a  reference  machine  in  the 


heterogeneous  system.  Similarly,  a  nominal 
communication  cost  ctJ  is  associated  with  the 
message  Miy  from  7,  to  7; .  Assume  there  are  e 
messages  where  (n  -  1 )  <  e  <  n2  so  that  the  task 
graph  is  a  connected  graph. 

To  model  heterogeneity  of  the  target  system 
which  consists  of  m  processors  {Pv  P2,  Pm}  , 
heterogeneity  factors  are  used.  For  example,  if  a 
task  Tj  is  scheduled  to  a  processor  Px,  then  its 
actual  execution  cost  is  given  by  hixXj  where  hix  is 
the  heterogeneity  factor  which  is  determined  by 
measuring  the  difference  in  processing  capabilities 
(e.g.,  speed)  of  processor  Px  and  the  reference 
machine  with  respect  to  task  7(  .  Similarly,  if  a 
message  Mtj  is  scheduled  to  the  communication 
link  Lxy  between  processors  Px  and  Py ,  its  actual 
communication  cost  is  given  by  h'jjXyCjj.  An 
example  parallel  program  graph  is  shown  in 
Figure  2. 
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The  start  time  and  finish  time  of  a  message  M i} 
from  7,  to  T j  on  a  communication  link  Lxy  are 
denoted  by  MST{Mijt  Lxy)  and  MFT(Mip  Lxy) , 
respectively.  Obviously,  we  have: 

MFUMi,,  Lv )  -  MST(Mif  L,y)  +  K 

The  start  time  of  a  task  7,-  on  processor  Px  is 
denoted  by  ST ( 7,-,  Px)  which  critically  depends  on 
the  task’s  data  ready  time  (DRT).  The  DRT  of  a 
task  is  defined  as  the  latest  arrival  time  of  messages 
from  its  predecessors.  The  finish  time  of  a  task  7, 
is  given  by  F7(7,,  Px )  =  57(7,,  Px)  +  hix x,- .  The 
objective  of  scheduling  is  to  minimize  the 
maximum  FT ,  which  is  called  the  schedule  length 
(SL). 


2.2  Parallel  Code  Generation  with  Duplication 
Based  Scheduling 

The  proposed  duplication  scheduler  is  designed 
as  a  core  module  in  the  CASCH  (Computer-Aided 
SCHeduling)  tool  [2],  The  system  organization  of 
the  CASCH  tool  is  shown  in  Figure  3.  It  generates 
a  task  graph  from  a  sequential  program,  uses  a 
scheduling  algorithm  to  perform  scheduling,  and 
then  generates  the  parallel  code  in  a  scheduled  form 
for  a  cluster  of  workstations.  The  timings  for  the 
tasks  and  messages  are  assigned  through  a  timing 
database  which  was  obtained  through  profiling  of 
the  basic  operations  [2],  [6],  As  soon  as  the  task 
graph  is  generated,  the  duplication  based  scheduler 
is  invoked. 
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Sequential  User  Program 


Figure  3:  The  organization  of  the  CASCH  tool. 


To  minimize  the  overall  execution  time  of  the 
application  on  the  cluster,  the  scheduler  first 
determines  which  tasks  are  more  critical  so  that 
they  need  to  be  scheduled  to  start  at  earlier  time 
slots,  possibly  by  duplicating  their  ancestors.  In  a 
task  graph,  the  critical-path  (CP),  which  consists  of 
tasks  forming  the  longest  path,  is  such  an  important 
structure  because  the  tasks  on  the  CP  potentially 
determine  the  overall  execution  time.  To  determine 
whether  a  task  is  a  CP  task,  we  can  use  two 
attributes:  t-level  (top  level)  and  b-level  (bottom 
level)  [13],  [24].  The  b-level  of  a  task  is  the  length 
of  the  longest  path  beginning  with  the  task.  The  t- 
level  of  a  task  is  the  length  of  the  longest  path 
reaching  the  task.  Thus,  all  tasks  on  the  CP  have  the 
same  value  of  ( t-level  +  b-level),  which  is  equal  to 
the  length  of  the  CP.  Based  on  this  observation,  we 
can  easily  partition  the  parallel  program  into  three 
categories:  CP  (critical  path),  IB  (in-branch),  and 


OB  (out-branch)  tasks.  The  IB  tasks  are  ancestors 
of  CP  tasks  but  are  not  CP  tasks  themselves.  The 
OB  tasks  are  neither  CP  nor  IB  tasks  and  as  such, 
are  relatively  less  important.  This  partitioning  can 
be  performed  in  0(e)  time  because  the  t-level  and 
b-level  of  all  tasks  can  be  computed  by  using  depth- 
first  search.  A  task  with  a  larger  b-level  implies  that 
it  is  followed  by  a  longer  chain  of  tasks,  and  thus,  is 
given  a  higher  priority.  A  procedure  is  outlined 
below  for  constructing  a  scheduling  list  based  on 
the  partitioning. 

Alogrithm  1:  Construction  of  Scheduling 
List 

Input:  a  program  task  graph  with  n  tasks 

[T{,T2,  ...,Tn} 

Output:  a  serial  order  of  the  tasks 

1 .  compute  the  t-level  and  b-level  of  each  task 
by  using  depth-first  search; 

2.  identify  the  CP;  if  there  are  multiple  CPs, 
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select  the  one  with  the  largest  sum  of 
execution  cost  and  ties  are  broken 
randomly; 

3.  put  the  CP  task  which  does  not  have  any 
predecessor  to  the  first  position  of  the 
serial  order; 

4.  i  <—  2 ;  Tx  <r-  the  next  CP  task 

5.  while  not  all  the  CP  tasks  are  included  do 

6.  if  Tx  has  all  its  predecessors  in  the  serial 
order  then 

7.  put  Tx  at  position  i  and  increment  i ; 

8.  else  let  Ty  be  the  predecessor  of  Tx 
which  is  not  in  the  serial  order  and  has  the 
largest  b-level  (ties  are  broken  by  choosing 
the  predecessor  with  a  smaller  t-level)\ 

9.  if  Ty  has  all  its  predecessors  in  the 
serial  order  then  put  Ty  at  position  i  and 
increment  i ;  otherwise,  recursively 
include  all  the  ancestors  of  Ty  in  the  serial 
order  such  that  the  tasks  with  a  larger  b- 
level  are  included  first; 

10.  repeat  the  above  step  until  all  the 
predecessors  of  Tx  are  in  the  serial  order; 

1 1 .  put  Tx  at  position  i  and  increment  i ; 

12.  Tx  <—  the  next  CP  task  ; 

13.  append  all  the  OB  tasks  to  the  serial  order 
in  descending  order  of  b-level\ 

Using  the  above  scheduling  list,  we  can 
determine  which  tasks  have  to  be  considered  first  in 
the  duplication  process.  During  scheduling,  the  CP 
tasks  are  always  considered  first.  However,  we 
cannot  attempt  to  schedule  the  CP  tasks  unless  all  of 
their  ancestor  tasks,  which  need  not  be  CP  tasks 
themselves,  are  scheduled.  Thus,  we  use  a  recursive 
approach.  For  each  CP  task,  we  first  recursively 
check  whether  its  ancestors  are  scheduled.  If  not, 
then  the  candidate  for  scheduling  will  be  changed  to 
the  unscheduled  ancestor  which  is  at  the  earliest 
position  on  the  scheduling  list.  To  actually  schedule 
a  task,  we  try  to  minimize  its  finish  time  by 
attempting  to  schedule  it  to  the  fastest  machine. 
Duplication  is  employed  for  the  minimization  of 
finish  times  in  that  as  many  ancestors  as  possible 
are  inserted  before  the  task.  The  duplication  process 


will  stop  when  the  finish  time  of  the  task  starts  to 
increase  or  the  time  slot  has  been  used  up.  The  order 
of  selecting  ancestors  for  duplication  is  governed 
by  the  scheduling  list.  The  heterogeneity  factors  hix 
are  also  used  for  determining  the  finish  times.  After 
all  the  CP  tasks  are  scheduled  (and  hence  all  the  IB 
tasks),  the  OB  tasks  are  considered  for  scheduling. 
To  avoid  using  an  excessive  number  of  machines, 
we  attempt  to  schedule  the  OB  tasks  without  using 
duplication.  This  is  useful  because  the  OB  tasks 
usually  do  not  affect  the  overall  completion  time 
and,  thus,  need  not  be  scheduled  to  finish  as  soon  as 
possible.  However,  if  such  a  conservative  approach 
fails — that  is,  the  overall  completion  time  is 
increased  by  scheduling  a  certain  OB  task  without 
using  duplication,  then  the  same  recursive 
duplication  process  will  be  applied  to  the  OB  task. 
The  whole  duplication  based  task  scheduling 
process  is  summarized  in  Alogrithm  2  below. 

Alogrithm  2:  Heterogeneous  Duplication 
Based  Scheduling 

Input:  a  program  task  graph  with  n  tasks 
{Tx,  T2,  ...,  Tn},  a  heterogeneous  system 
with  m  machines  {P,,  P2,  Pm} ,  and  the 
relative  speeds  of  the  machines; 

Output:  a  duplication  based  schedule 

1.  Construct  the  scheduling  list  (use 
Alogrithm  1); 

2.  For  each  CP  task,  first  recursively  schedule 
each  of  its  unscheduled  ancestor  IB  tasks 
to  a  machine  so  that  they  can  finish  as  soon 
as  possible  by  trying  to  duplicate  on  the 
machine  as  many  ancestors  as  the  time  slot 
allows  (use  the  heterogeneity  factors  hix 
for  determining  the  finish  times);  the  order 
of  selecting  tasks  for  duplication  is 
governed  by  the  scheduling  list;  finally 
apply  the  same  recursive  duplication 
process  to  the  CP  task  itself; 

3.  Without  using  any  duplication,  schedule 
each  of  the  remaining  tasks  (i.e.,  OB  tasks) 
to  the  fastest  machine  provided  that  the 
schedule  length  does  not  increase;  if  this 
fails,  employ  recursive  duplication 
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technique  to  schedule  the  OB  task; 

To  illustrate  how  the  heterogeneity  of  the 
machines  is  exploited,  consider  in  Figure  4  the  two 
schedules  of  the  Gaussian  elimination  task  graph 
(shown  earlier  in  Figure  1).  The  schedule  on  the  left 
is  the  best  schedule  without  duplication  using 
homogeneous  machines.  On  the  right  is  a  schedule 
using  six  heterogeneous  machines  in  which  P2  is  of 
the  same  speed  as  the  machines  in  the  left  schedule, 
while  P0  and  Pl  are  two  times  and  1 .3  times  faster 
than  P2,  respectively.  The  remaining  machines  are 
slower  than  P2  ■  We  can  see  that  the  CP  of  the  task 
graph  is  scheduled  to  the  fastest  machine  P0 .  The 
critical  IB  tasks,  TA  and  T5,  are  also  scheduled  to 
finish  as  early  as  possible  on  fast  machines  P,  and 
P2 ,  respectively,  by  duplicating  T , .  The  resulting 
schedule  has  an  overall  completion  time  of  182 
units  which  is  significantly  smaller  than  that  of  the 
homogeneous  schedule  without  duplication  (330 
units)^.  Due  to  space  limitations,  detailed  steps  of 
producing  the  two  schedules  are  not  shown. 

After  a  symbolic  schedule  is  generated,  the  code 
generator  is  invoked  to  actually  implement  the 
schedule  using  the  SPMD  (Single  Program 
Multiple  Data)  model  [2],  [23].  The  program 
statements  or  procedures  constituting  a  task  Tt  are 
allocated  to  the  specified  machine  P  •  for  execution 
using  conditional  statements  checking  the  ID  of  the 
machine,  as  shown  in  Figure  5.  Data  structures 
associated  with  a  task  are  also  replicated.  The 
output  of  the  code  generator  is  a  C  program  in 
which  MPI  communication  primitives  are  inserted. 
The  resulting  parallel  program  is  then  compiled  and 
executed  on  the  cluster  of  workstations. 

3  Performance  Results 

We  have  implemented  the  duplication  based 

t  In  the  homogeneous  case,  the  scheduler  is  also  given  six 
machines.  However,  to  arrive  at  the  best  schedule 
shown,  it  needs  only  three. 


scheduling  algorithm  in  the  code  generator  module 
of  the  CASCH  tool  (see  Figure  3),  which  is 
executable  on  a  Linux-based  Pentium  II  PC  in  our 
cluster.  We  have  parallelized  several  numerical 
applications  on  CASCH.  Here,  we  present  and 
discuss  some  preliminary  results  obtained  by 
measuring  the  execution  time  of  three  applications: 
Gaussian  elimination,  Gauss-Seidel  algorithm,  and 
N-Body  problem.  By  varying  the  problem  sizes 
(i.e.,  the  dimensions  of  the  matrices  in  these 
applications,  from  32  to  256)  and  the  granularities 
(from  1 -column  block  to  8-column  block,  using  1- 
D  decomposition),  we  generated  four  task  graphs 
for  each  application  with  roughly  100,  200,  400, 
and  800  tasks. 

Our  heterogeneous  cluster  consists  of  twelve 
PCs:  eight  Pentium  II  333  MHz  with  32  MB 
memory  and  four  Pentium  II 450  MHz  with  64  MB 
memory.  The  PCs  are  connected  by  a  Fore  Fast 
Ethernet  switch.  All  the  experiments  were 
performed  using  eight  PCs  but  with  different 
configurations:  (1)  eight  homogeneous  machines 
(i.e.,  all  are  Pentium  II  333  MHz);  (2)  five  Pentium 
II  333  MHz  plus  two  Pentium  II 450  MHz;  (3)  two 
Pentium  II  333  MHz  plus  four  Pentiume  II  450 
MHz.  The  aggregate  computing  power  of  the  three 
configurations  are  approximately  the  same  because 
we  found  that  a  Pentium  II  450  MHz  is  about  1 .5 
times  faster  than  a  Pentium  II  333  MHz.  The 
rationale  behind  selecting  these  configurations  is 
that  we  wanted  to  investigate  the  benefit  of 
heterogeneity.  These  configurations  are  denoted  as 
8S  (for  eight  slow  machines),  2F+5S  (two  fast  plus 
five  slow  machines),  and  4F+2S  (four  fast  plus  two 
slow  machines),  respectively.  Ten  different  runs  for 
each  size  of  the  three  applications  were  done  and 
the  average  application  execution  times  were  noted. 

These  average  execution  times  of  the  three 
applications  are  shown  in  Figure  6.  As  can  be  seen. 


370 


Figure  4:  The  effect  of  heterogeneity. 


if  (mynode()  ==  j)  { 

/*  execution  of  task  i  7 


} 

Figure  5:  SPMD  implementation  of  a  schedule. 

heterogeneity  has  a  significant  impact  on  the 
overall  execution  time  of  an  application  in  that 
using  more  fast  machines  (albeit  the  total  number  of 
machines  is  smaller),  in  general,  can  speedup  the 
application  considerably.  The  improvement  in  the 
Gaussian  elimination  application  is  the  most 
remarkable.  This  can  be  explained  by  the  fact  that 
the  Gaussian  elimination  graph  has  a  distinctive 
critical-path  (see  Figure  2),  the  tasks  on  which  can 
be  scheduled  to  the  fastest  machine.  On  the  other 
hand,  as  the  Gauss-Seidel  task  graph  has  many 
intersecting  critical-paths  [23],  the  duplication 
approach  is  less  effective  in  exploiting  the 
advantage  of  heterogeneity.  The  improvement  of 
the  heterogeneous  approaches  for  the  N-Body 
problem,  which  has  a  slightly  less  regular  task 
graph  structure  [23],  is  also  considerable. 

4  Conclusions  and  Future  Work 

In  this  paper,  we  have  presented  a  duplication 
based  approach  in  scheduling  tasks  to  a 
heterogeneous  cluster  of  PCs.  The  scheduling 
algorithm  works  by  recursively  duplicating  critical 
tasks  to  the  faster  machines  in  order  to  minimize  the 
finish  times.  The  algorithm  has  been  implemented 
in  our  prototype  program  parallelization  tool  for 
generating  MPI  code  executable  on  a  cluster  of 
Pentium  PCs.  Our  experiments  using  three 
numerical  applications  have  indicated  that 
heterogeneity  of  PCs  cluster  is  indeed  useful  for 
optimizing  the  execution  of  parallel  programs.  One 


important  issue  related  to  using  a  PC  cluster  is  fault- 
tolerance.  Unlike  a  tightly  couple  parallel 
architecture  (e.g.,  the  IBM  SP2),  a  PC  in  a  cluster 
may  experience  intermittent  failure,  possibly  due  to 
user  reboots.  Thus,  the  task  schedule  has  to  be  fault- 
tolerant  so  that  the  application  can  finish  its 
execution  even  in  the  presence  of  such  faults.  We 
believe  that  task  duplication,  augmented  with 
check-pointing  and  roll-back  recovery  techniques, 
is  a  viable  approach  to  achieve  this  goal.  A 
performance  model  is  being  developed  to 
quantitatively  analyze  the  merits  of  this  approach. 
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Abstract 

The  Min-min  algorithm  is  a  simple  algorithm .  It  runs 
fast  and  delivers  good  performance.  However,  the  Min-min 
algorithm  schedules  small  tasks  first ;  resulting  in  some  load 
imbalance.  In  this  paper,  we  present  an  algorithm  which 
improves  the  Min-min  algorithm  by  scheduling  large  tasks 
first.  The  new  algorithm,  Segmented  min-min,  balances  the 
load  well  and  demonstrates  even  better  performance  in  both 
make  span  and  running  time. 


1.  Introduction 

A  heterogeneous  computing  environment  utilizes  a  suite 
of  different  machines  interconnected  by  high-speed  net¬ 
works  to  execute  different  computationally  intensive  appli¬ 
cations  that  have  diverse  computational  requirements  [8, 12, 
13].  The  general  problem  of  mapping  tasks  to  machines  has 
been  shown  to  be  NP-complete  [10].  Many  useful  heuris¬ 
tics  to  perform  this  mapping  function  have  been  developed. 
Among  many  sophisticated  algorithms,  the  Min-min  algo¬ 
rithm  [10]  is  a  simple  algorithm  which  runs  fast  and  delivers 
satisfactory  performance.  It  selects  from  all  tasks  the  task 
that  minimizes  the  completion  time  on  a  machine.  In  most 
situations,  it  maps  as  many  tasks  as  possible  to  their  first 
choice  of  machine.  However,  the  Min-min  algorithm  is  un¬ 
able  to  balance  the  load  well  since  it  usually  schedules  small 
tasks  first.  In  this  paper,  we  propose  a  simple  alternative  of 
the  Min-min  algorithm  by  scheduling  large  tasks  first.  The 
proposed  algorithm  retains  the  advantage  of  the  Min-min 
algorithm  and  achieves  good  load  balance  at  the  same  time. 

This  paper  presents  the  new  algorithm,  named  the  Seg¬ 
mented  min-min  algorithm.  In  section  2,  previous  heuristic 
algorithms  are  reviewed.  Section  3  presents  the  new  algo¬ 


rithm.  Section  4  exhibits  the  simulation  model  and  experi¬ 
mental  results.  Section  5  concludes  the  paper. 

2.  Previous  Heuristics 

In  this  section,  we  review  a  set  of  heuristic  algorithms 
which  schedule  meta-tasks  to  heterogeneous  computing 
systems.  A  meta-task  is  defined  as  a  collection  of  inde¬ 
pendent  tasks  with  no  data  dependences.  Meta-tasks  are 
mapped  onto  machines  statically;  each  machine  executes  a 
single  task  at  a  time.  For  static  mapping,  it  is  assumed  that 
the  number  of  tasks,  t ,  and  the  number  of  machines,  m,  are 
known  a  priori. 

A  large  number  of  heuristic  algorithms  have  been  de¬ 
signed  to  schedule  tasks  to  machines  on  heterogeneous 
computing  systems.  In  [2],  eleven  commonly  used  algo¬ 
rithms  have  been  evaluated,  listed  as  follows. 

OLB  :  Opportunistic  Load  Balancing  (OLB)  assigns  each 
task,  in  arbitrary  order,  to  the  next  available  ma¬ 
chine  [1,7,  8], 

UDA  :  User-Directed  Assignment  (UDA)  assigns  each 
task,  in  arbitrary  order,  to  the  machine  with  the  best 
expected  execution  time  for  the  task  [1,7]. 

Fast  Greedy  :  Fast  Greedy  assigns  each  task,  in  arbitrary 
order,  to  the  machine  with  the  minimum  completion 
time  for  that  task  [1]. 

Min-min  :  In  Min-min,  the  minimum  completion  time  for 
each  task  is  computed  respect  to  all  machines.  The  task 
with  the  overall  minimum  completion  time  is  selected 
and  assigned  to  the  corresponding  machine.  The  newly 
mapped  task  is  removed,  and  the  process  repeats  until 
all  tasks  are  mapped  [1,  7,  10]. 
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Max-min  ;  The  Max-min  heuristic  is  very  similar  to  the 
Min-min  algorithm.  The  set  of  minimum  completion 
times  is  calculated  for  every  task.  The  task  with  overall 
maximum  completion  time  from  the  set  is  selected  and 
assigned  to  the  corresponding  machine  [1,7,  10]. 

Greedy  :  The  Greedy  heuristic  is  literally  a  combination 
of  the  Min-min  and  Max-min  heuristics  by  using  the 
better  solution  [1,  7]. 

GA  :  The  Genetic  algorithm  (GA)  is  used  for  searching 
large  solution  space.  It  operates  on  a  population  of 
chromosomes  for  a  given  problem.  The  initial  popula¬ 
tion  is  generated  randomly.  A  chromosome  could  be 
generated  by  any  other  heuristic  algorithm.  When  it  is 
generated  by  Min-min,  it  is  called  “seeding”  the  popu¬ 
lation  with  Min-min  [15,  14]. 

SA  :  Simulated  Annealing  (SA)  is  an  iterative  technique 
that  considers  only  one  possible  solution  for  each 
meta-task  at  a  time.  SA  uses  a  procedure  that  prob¬ 
abilistically  allows  solution  to  be  accepted  to  attempt 
to  obtain  a  better  search  of  the  solution  space  based  on 
a  system  temperature  [5,  11]. 

GSA  :  The  Genetic  Simulated  Annealing  (GSA)  heuristic 
is  a  combination  of  the  GA  and  SA  techniques  [3]. 

Tabu  :  Tabu  search  is  a  solution  space  search  that  keeps 
track  of  the  regions  of  the  solution  space  which  have 
already  been  searched  so  as  not  to  repeat  a  search  near 
these  areas  [6,  9]. 

A*  :  A*  is  a  tree  search  beginning  at  a  root  node  that  is 
usually  a  null  solution.  As  the  tree  grows,  intermediate 
nodes  represent  partial  solutions  and  leaf  nodes  repre¬ 
sent  final  solutions.  Each  node  has  a  cost  function,  and 
the  node  with  the  minimum  cost  function  is  replaced 
by  its  children.  Any  time  a  node  is  added,  the  tree  is 
pruned  by  deleting  the  node  with  the  largest  cost  func¬ 
tion.  This  process  continues  until  a  complete  mapping 
(a  leaf  node)  is  reached  [4], 

The  experimental  results  from  [2]  show  that  OLB,  UDA, 
Max-min,  SA,  GSA,  and  Tabu  do  not  produce  good  sched¬ 
ules  in  general.  Min-min,  GA,  and  A*  are  able  to  deliver 
good  performance.  The  difference  between  the  completion 
times  of  the  schedules  (makespans)  generated  by  these  three 
algorithms  is  within  10%.  GA  is  consistently  better  than 
Min-min  by  a  few  percents,  since  it  is  seeding  the  popula¬ 
tion  with  a  Min-min  chromosome.  A*,  on  the  other  hand, 
produces  better  or  worse  schedules  than  Min-min  and  GA 
in  different  situations.  Among  the  three  algorithms,  Min- 
min  is  the  fastest  algorithm,  GA  is  much  slower,  and  A*  is 
very  slow.  For  512  tasks  and  16  machines,  the  running  time 


of  Min-min  is  about  1  second,  GA  30  seconds,  and  A*  1200 
seconds  [2]. 

Min-min  is  a  simple  algorithm,  fast,  and  able  to  deliver 
good  performance.  Even  GA  has  to  be  “seeding”  the  popu¬ 
lation  with  a  Min-min  chromosome  to  obtain  its  good  per¬ 
formance.  Min-min  schedules  the  “best  case”  tasks  first  and 
generates  relatively  good  schedules.  The  drawback  of  Min- 
min  is  that  it  assigns  the  small  task  first.  Thus,  the  smaller 
tasks  would  execute  first  and  then  a  few  larger  tasks  execute 
while  several  machines  sit  idle,  resulting  in  poor  machine 
utilization.  We  propose  a  simple  method  to  enforce  large 
tasks  to  be  scheduled  first.  Tasks  are  partitioned  into  seg¬ 
ments  according  to  their  execution  times.  The  segment  with 
larger  tasks  is  scheduled  first  with  the  Min-min  algorithm 
being  applied  within  the  segment.  This  is  called  Segmented 
min-min  (Smm). 

3.  The  Segmented  Min-Min  Algorithm 

Every  task  has  a  ETC  ( expected  time  to  compute)  on  a 
specific  machine.  If  there  are  t  tasks  and  m  machines,  we 
can  obtain  a  t  xm  ETC  matrix .  ETC{i,j)  is  the  estimated 
execution  time  for  task  i  on  machine  j. 

The  Segmented  min-min  algorithm  sorts  the  tasks  accord¬ 
ing  to  ETCs.  The  tasks  can  be  sorted  into  an  ordered  list  by 
the  average  ETC,  the  minimum  ETC,  or  the  maximum  ETC. 
Then,  the  task  list  is  partitioned  into  segments  with  the  equal 
size.  The  segment  of  larger  tasks  is  scheduled  first  and  the 
segment  of  smaller  tasks  last.  For  each  segment,  Min-min 
is  applied  to  assign  tasks  to  machines.  The  algorithm  is  de¬ 
scribed  as  follows. 


Segmented  min-min  (Smm) 

1 .  Compute  the  sorting  key  for  each  task: 

SUB-POLICY  1  — Smm-avg:  Compute  the  average 
value  of  each  row  in  ETC  matrix 

keyi  =  ETC(i,j)/m. 
j 

Sub-policy  2  —  Smm-min:  Compute  the  mini¬ 
mum  value  of  each  row  in  ETC  matrix 

keyi  =  min  ETC(i,j). 

j 

Sub-policy  3  —  Smm-max:  Compute  the  maxi¬ 
mum  value  of  each  row  in  ETC  matrix 

keyi  =  ma  xETC(i,j). 

j 

2.  Sort  the  tasks  into  a  task  list  in  decreasing  order  of 
their  keys. 

3.  Partition  the  tasks  evenly  into  N  segments. 

4.  Schedule  each  segment  in  order  by  applying  Min-min. 
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Different  from  the  Min-min  algorithm,  Segmented  min- 
min  performs  task  sorting  before  scheduling.  Sorting  im¬ 
plies  that  larger  tasks  are  promoted  to  be  scheduled  earlier. 
Then,  Min-min  is  applied  locally  within  each  segment.  The 
problem  here  is  how  to  define  the  sorting  key.  Tasks  with 
long  execution  time  deserve  promotion  to  early  scheduling. 
However,  in  a  heterogeneous  system,  the  execution  time  of 
a  task  varies  in  different  machines.  Therefore,  we  test  three 
sub-policies  by  defining  the  execution  time  of  a  task  as  the 
average,  the  minimum,  or  the  maximum  of  its  ETCs. 

The  third  step  of  the  Segmented  min-min  algorithm  par¬ 
titions  tasks  into  N  segments.  Determining  the  optimal 
value  of  TV  is  a  trade-off.  More  segments  result  in  better 
load  balance.  On  the  other  hand,  too  many  segments  will 
lose  advantages  of  the  Min-min  algorithm.  Intuitively,  as 
long  as  we  partition  the  tasks  into  a  few  segments,  such  as 
large,  medium,  and  small  tasks,  the  load  can  be  balanced 
fairly  well.  Experimental  results  confirm  this  as  shown  in 
Figure  1  where  the  curves  show  the  improvement  of  Smm- 
avg  over  Min-min  for  different  values  of  N.  Each  point  in 
these  curves  is  the  average  of  five  runs.  In  general,  the  op¬ 
timal  value  of  N  is  relevant  to  the  ratio  c  =  — .  When  c 

m 

is  large,  Min-min  performs  well.  For  small  c,  which  means 
the  number  of  tasks  per  machine  is  not  large,  the  optimal 
value  of  N  is  about  4  or  5.  Therefore,  we  fix  the  value  of 
N  to  4,  which  means  that  we  always  partition  the  tasks  into 
four  segments. 


Improve- 


Figure  1.  The  N  Value. 


4.  Experiments 

4.1.  Performance  Comparison 

For  the  experimental  studies,  we  use  the  same  method 
in  [2]  to  generate  the  test  set.  The  parameters  include  Con¬ 
sistent,  Inconsistent,  or  Semi-Consistent;  High  or  Low  Tesk 
Heterogeneity;  and  High  or  Low  Mechine  Heterogeneity. 
For  details,  see  [2],  All  experiment  results  are  based  on  512 
tasks,  16  or  32  machines,  100  trails  and  N  =  4.  The  results 
for  16  machines  are  shown  in  Tables  I  to  XII  and  that  for  32 
machines  are  shown  in  Tables  XIII  to  XXIV.  In  these  tables, 
the  second  column  shows  the  utilization  of  machines  which 

IS  defined  as  1  -  The  third  column  is  the 

makespan  (the  completion  time)  of  schedules.  The  fourth 
column  is  the  improvement  of  each  Segmented  min-min  al¬ 
gorithm  over  the  Max-min  algorithm  and  the  fifth  column 
is  that  over  the  Min-min  algorithm.  The  last  column  shows 
the  running  time  of  each  algorithm. 

4.2.  Discussion 

From  these  results,  we  found  that  the  Segmented  min- 
min  algorithm  is  able  to  balance  the  load  very  well  com¬ 
pared  to  the  Max-min  and  the  Min-min  algorithms.  The 
system  utilization  of  Min-min  is  relatively  low  while  that 
of  Segmented  min-min  is  very  high.  This  is  because  Seg¬ 
mented  min-min  schedules  larger  tasks  first  and  smaller 
tasks  can  run  in  parallel  with  large  tasks.  Although  the 
Max-min  algorithm  produces  very  good  load  balancing,  it 
does  not  schedule  tasks  to  their  “best  case.”  Thus,  its  perfor¬ 
mance  is  far  worse  than  that  of  the  Segmented  min-min  al¬ 
gorithm.  Higher  system  utilization  makes  three  Segmented 
min-min  algorithms  better  than  Min-min  in  almost  all  cases. 
Smm-avg  enhances  the  performance  of  Min-min  from  2%  to 
12%.  Smm-min  shows  better  performance  than  Smm-avg  in 
some  cases  but  is  worse  than  Smm-avg  in  most  cases.  Smm- 
max  is  worse  than  Smm-avg  in  almost  all  cases.  Thus,  we 
use  Smm-avg  for  the  Segmented  min-min  algorithm,  which 
improves  the  Min-min  algorithm  by  6.1%  in  average. 

In  addition,  the  running  time  of  the  Segmented  min-min 
algorithm  is  much  less  than  Min-min.  This  is  not  difficult  to 
explain  because  Min-min  spends  the  large  amount  of  time 
to  search  entire  matrix  to  map  one  task  each  time,  while 
Segmented  min-min,  taking  advantage  of  the  divide-and- 
conquer  strategy,  only  searches  the  minimum  value  within 
a  single  partition.  In  summary,  this  partitioning  method  im¬ 
proves  the  makespan  and  running  time  simultaneously. 
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Table  1. 16  Machines,  Inconsistent,  Low  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO3  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 

Time  (Sec.) 

Max-min 

5.425 

- 

- 

1.19 

Min-min 

2.915 

- 

- 

Smm-avg 

2.767 

96.0% 

5.3% 

Smm-min 

mmm 

2.746 

96.3% 

6.1% 

Smm-max 

97.8% 

2.784 

94.9% 

4.7% 

Table  II.  16  Machines,  Inconsistent,  Low  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO5  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

IKEHS 

2.513 

- 

- 

1.19 

Min-min 

1.214 

- 

- 

1.06 

Smm-avg 

1.113 

125.8% 

9.1% 

Smm-min 

98.2% 

1.064 

136.2% 

14.2% 

0.33 

Smm-max 

IKEEIH 

1.135 

121.4% 

7.0% 

0.33 

Table  III.  16  Machines,  Inconsistent,  High  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO4  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

15.943 

- 

- 

1.20 

Min-min 

8.588 

- 

- 

1.07 

Smm-avg 

8.139 

95.9% 

5.5% 

0.33 

Smm-min 

■fc. 

8.087 

97.1% 

6.2% 

0.33 

Smm-max 

97.9% 

8.190 

94.7% 

4.8% 

0.33 

Table  IV.  16  Machines,  Inconsistent,  High  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO6  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.8% 

7.375 

- 

- 

1.20 

Min-min 

3.573 

- 

- 

1.07 

Smm-avg 

3.279 

124.9% 

8.9% 

0.33 

Smm-min 

3.131 

135.5% 

14.1% 

0.33 

Smm-max 

95.9% 

3.344 

125.5% 

6.9% 

0.33 
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Table  V.  16  Machines,  Consistent,  Low  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO3  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.9% 

7.415 

- 

- 

1.22 

Min-min 

- 

- 

1.07 

Smm-avg 

2.7% 

0.33 

Smm-min 

27.6% 

0.33 

Smm-max 

98.4% 

5.749 

29.0% 

1.9% 

0.33 

Table  VI.  16  Machines,  Consistent,  Low  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO5  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.9% 

4.125 

- 

- 

1.23 

Min-min 

2.866 

- 

- 

1.07 

Smm-avg 

2.805 

47.1% 

2.1% 

HE&flN 

Smm-min 

42.3% 

-2.0% 

Smm-max 

2.867 

43.9% 

■EMHii 

0.33 

Table  VII.  16  Machines,  Consistent,  High  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO5  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

100.0% 

2.181 

- 

- 

1.24 

Min-min 

93.9% 

1.725 

- 

- 

1.08 

Smm-avg 

1.679 

29.9% 

2.8% 

0.33 

Smm-min 

27.5% 

0.33 

Smm-max 

1.693 

28.8% 

1.9% 

0.33 

Table  VIII.  16  Machines,  Consistent,  High  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO6  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.9% 

- 

- 

1.24 

Min-min 

88.9% 

8.437 

- 

- 

1.07 

Smm-avg 

97.7% 

8.258 

47.2% 

2.2% 

6.33 

Smm-min 

96.7% 

8.564 

41.9% 

-1.5% 

0.33 

Smm-max 

97.4% 

8.430 

44.2% 

0.0% 

0.33 
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Table  IX.  16  Machines,  Semi-Consistent,  Low  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO3  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.9% 

6.339 

- 

- 

1.21 

Min-min 

91.8% 

3.745 

- 

- 

1.07 

Smm-avg 

98.2% 

3.595 

76.3% 

4.2% 

0.33 

Smm-min 

98.1% 

74.9% 

3.3% 

0.33 

Smm-max 

98.0% 

3.624 

74.9% 

3.3% 

0.33 

Table  X.  16  Machines,  Semi-Consistent,  Low  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO5  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.8% 

3.199 

- 

- 

1.21 

Min-min 

BS9I 

1.664 

- 

- 

1.07 

Smm-avg 

1.569 

103.9% 

6.1% 

0.33 

Smm-min 

96.5% 

1.593 

100.8% 

4.5% 

0.33 

Smm-max 

96.3% 

1.590 

101.2% 

4.6% 

0.33 

Table  XI.  16  Machines,  Semi-Consistent,  High  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO5  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

■CvX-r«i 

1.862 

- 

1.21 

Min-min 

tjWg£1 

1.104 

- 

- 

Smm-avg 

1.058 

4.4% 

0.33 

Smm-min 

74.7% 

3.5% 

0.33 

Smm-max 

98.0% 

1.067 

3.4% 

0.33 

Table  XII.  16  Machines,  Semi-Consistent,  High  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO6  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

. 

- 

1.21 

4.882 

- 

- 

1.07 

96.9% 

4.619 

5.7% 

0.33 

96.6% 

4.693 

99.7% 

4.0% 

0.33 

Smm-max 

96.5% 

4.673 

100.5% 

4.7% 

0.33 
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Table  XIII.  32  Machines,  Inconsistent,  Low  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO3  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

1.954 

- 

- 

2.23 

Min-min 

1.294 

- 

- 

2.16 

Smm-avg 

93.1% 

1.199 

63.0% 

7.9% 

1.16 

Smm-min 

1.188 

64.5% 

8.9% 

!gEggjH| 

Smm-max 

1.206 

7.3% 

Table  XIV.  32  Machines,  Inconsistent,  Low  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO4  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

98.8% 

6.395 

- 

- 

2.24 

Min-min 

68.0% 

3.959 

- 

- 

2.16 

Smm-avg 

3.523 

81.5% 

12.4% 

1.16 

Smm-min 

3.678 

73.9% 

7.6% 

1.10 

Smm-max 

82.1% 

3.502 

82.6% 

1.10 

Table  XV.  32  Machines,  Inconsistent,  High  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO4  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.5% 

5.755 

- 

~ 

Min-min 

85.2% 

3.804 

- 

- 

2.16 

Smm-avg 

93.2% 

3.525 

63.3% 

7.9% 

Smm-min 

93.9% 

3.498 

64.5% 

Smm-max 

92.4% 

3.556 

61.8% 

7.0% 

1.10 

Table  XVI.  32  Machines,  Inconsistent,  High  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO6  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

1.882 

- 

- 

mebi | 

Min-min 

- 

- 

Smm-avg 

81.7% 

flKEsSSI 

74.3% 

12.4% 

1.16 

Smm-min 

8.2% 

Smm-max 

11.8% 

1.09 
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Table  XVII.  32  Machines,  Consistent,  Low  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO3  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.8% 

- 

- 

2.26 

Min-min 

88.4% 

3.129 

- 

- 

2.18 

Smm-avg 

94.8% 

2.982 

17.4% 

4.9% 

1.17 

Smm-min 

93.4% 

3.025 

15.8% 

3.4% 

1.09 

Smm-max 

94.1% 

3.005 

16.5% 

4.1% 

1.09 

Table  XVIII.  32  Machines,  Consistent,  Low  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO5  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.7% 

1.707 

- 

- 

2.27 

Min-min 

76.4% 

1.296 

- 

- 

2.18 

Smm-avg 

89.3% 

1.245 

37.1% 

4.1% 

1.17 

Smm-min 

87.0% 

1.279 

33.5% 

1.3% 

Smm-max 

87.9% 

1.260 

35.5% 

2.9% 

i  1.09 

Table  XIX.  32  Machines,  Consistent,  High  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO4  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.9% 

10.305 

- 

2.27 

Min-min 

88.5% 

9.196 

- 

- 

2.19 

Smm-avg 

94.8% 

8.775 

17.4% 

4.8% 

1.18 

Smm-min 

93.6% 

8.887 

16.6% 

3.5% 

1.10 

Smm-max 

94.1% 

8.849 

16.5% 

3.9% 

Table  XX.  32  Machines,  Consistent,  High  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO6  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.8% 

5.016 

- 

2.26 

Min-min 

76.5% 

3.814 

- 

- 

2.18 

Smm-avg 

89.3% 

3.668 

36.8% 

4.0% 

1.18 

Smm-min 

87.0% 

3.768 

33.1% 

1.2% 

1.09 

Smm-max 

87.9% 

3.717 

34.9% 

2.6% 

1.09 
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Table  XXI.  32  Machines,  Semi-Consistent,  Low  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO3  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

— "  4fn 

- 

— 

2.23 

Min-min 

1.773 

- 

- 

2.19 

Smm-avg 

1.674 

54.5% 

5.9% 

1.17 

Smm-min 

1.679 

5.6% 

1.09 

Smm-max 

1.683 

53.7% 

5.3% 

1.09 

Table  XXII.  32  Machines,  Semi-Consistent,  Low  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO4  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.3% 

10.230 

- 

- 

2.22 

Min-min 

66.4% 

6.121 

- 

- 

2.20 

Smm-avg 

HiL&LJi 

82.5% 

9.2% 

1.19 

Smm-min 

5.714 

7.1% 

HUH 

Smm-max 

5.682 

7.7% 

SiB 

Table  XXIII.  32  Machines,  Semi-Consistent,  High  Task,  Low  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 
(xlO4  Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.7% 

- 

- 

2.23 

Min-min 

85.1% 

5.226 

- 

- 

2.20 

Smm-avg 

4.925 

54.3% 

6.1% 

1.19 

Smm-min 

4.937 

54.2% 

5.9% 

1.10 

Smm-max 

92.1% 

4.967 

53.1% 

5.2% 

1.10 

Table  XXIV.  32  Machines,  Semi-Consistent,  High  Task,  High  Machine  Heterogeneity 


Algorithm 

System 

Utilization 

Makespan 

(xl06Sec.) 

Improvement 
over  Max-min 

Improvement 
over  Min-min 

Running 
Time  (Sec.) 

Max-min 

99.3% 

3.012 

- 

- 

Min-min 

66.3% 

1.797 

- 

- 

2.19 

Smm-avg 

84.1% 

1.645 

83.1% 

9.2% 

1.18 

Smm-min 

84.5% 

1.682 

79.0% 

6.8% 

pjjjjTjjjgggil 

HiHiSi 

1.674 

80.0% 

7.3% 

1.10 
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5.  Concluding  Remarks 

The  Segmented  min-min  algorithm  starts  from  a  set  of 
large  tasks  while  Min-min  starting  from  small  tasks.  Smm 
can  balance  the  load  very  well  and  runs  faster.  We  will  com¬ 
pare  it  in  the  near  future  to  the  Genetic  algorithm  that  de¬ 
livered  the  best  performance  among  eleven  selected  algo¬ 
rithms. 
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